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IT) (54) Title: STATISTICAL ALGrORITHMS FOR FOLDING AND TARGET ACCESSIBILITY PREDICTION AND DESIGN OF 
^ NUCLEIC ACIDS 
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2 (57) Abstract: A method of predicting suiictural characteristics of a nucleic acid molecule. A method of predicting single -stranded 
regions in the secondary stracture of a nucleic acid molecule in accordance with a probability, distribution of strucmres based on 
^ recursively generated partition fiuiciions for the identification of accessible sites on target RNA for gene down-regulation and the ra- 
^ tional design of anti.sen.se; oligos, uans-cleaving ribozymes, siRNAss and antisense RNAs, for interaction with other RNA-iaigcling 
^ molecules, and for rational design of nucleic acid probes such as* molecular beacons for RNA or DNA targets. 
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TTTT F. OF TTTF. INVENTION 

STATISTICAL ALGORITHMS FOR FOLDING AND TARGET 
ACCESSBBBLrrY PREDICTION AND DESIGN OF NUCLEIC ACIDS 
RFT.ATED APPLICATIONS/INCO RPORATION BY REFERENCE 

Tliis application claims priority to U.S. provisional application Serial No. 

5 60/352,643, filed January 29, 2002, incorporated herein by reference. 

Indeed, each of the appUcations and patents cited in tliis text, as well as each 
document or reference cited in each of the applications and patents (including during 
the prosecution of each issued patent; "application cited documents"), and each of 
the PCI and foreign applications or patents corresponding to and/or claiming 

10 priority from any of these applications and patents, and each of the documents cited 
or referenced in each of the application cited documents, are hereby expressly 
mcorporated herein by reference. More generally, documents or references are cited 
in this text, either in a Reference List before the claims, or in the text itself; and, 
each of these docuinents or references ('lierein-cited references"), as well as each 

15 document or reference cited in each of the herein-cited references (including any 
manufacturer's specifications, instructions, etc.), is hereby expressly incorporated 
herein by reference. 

FIELD OF THE INVENTION 
The present invention relates to statistical algorithms for predicting structural 
20 characteristics of nucleic acid molecules and target accessibiUty prediction for the 
rational design of antisense nucleic acids, for evaluating molecular interactions, and 
for design of nucleic acid probes. 

BACKGROUND OF THE INVENTION 
Efficient gene down-regulation methods are of paramount importance for 
25 high-throughput fimctional studies of genes and gene products in humans and model 
organisms, as well as in infectious pathogens, and for the validation of new 
therapeutic targets and agents for the treatment of human diseases. Antisense 
oligonucleotides (oUgos) and fra/u-cleaving ribozymes have been widely used for 
mhibition of gene expression in both prokaryotes and eukaryotes. It has been 
30 recently shown that short interfering RNAs (siRNAs) can also induce gene silencing 
in mammaUan cells through a process known as RNA interference (KNAi). 
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Together, these RNA-targeting have emerged as increasingly important tools for 
gene modulation. For these antisense nucleic acid molecules to be effective, they 
must first bind to target messenger RNA (mRNA) or viral RNA in a sequence- 
specific manner, through complementary base pairing. To a large extent, target 
5 accessibility is determined by the secondary structure of the target RNA. 

Experimental approaches for accessibiUty evaluation are laborious, time consuming, 
and expensive. As a result, computational methods for accessibility prediction have 
been in development. 



10 made to the following: 

U.S. Patent No. 5,780,610 ("the '610 patenf issued to Collins et al is 
directed toward a method for substantially reducing background signals encountered 
in nucleic acid hybridization assays. The method is premised on the elimination or 
significant reduction of the phenomenon of non-specific hybridization, so as to 

1 5 provide a detectable signal which is produced only in the presence of the target 
polynucleotide of interest. As applied to the construction of hybridizing 
oligonucleotides for antisense compounds, the *610 patent describes the use of short 
regions of hybridization between multiple probes and a target to reduce nonspecific 
hybridization with non-target species that result firom using convrational antisense 

20 molecules. 

U.S. Patent Nos. 5,856,103 and 6,183,966 issued to Gray et al relate to a 
s>^tem and method for assessing the minimum of RNA:DNA sequence 
combinations whose properties need to be determined for selecting antisense 
oligonucleotide sequences that will form the most stable hybrid among all those 

25 possible in a given target mRNA sequence. The method fiirther comprises a data 
processing system for identifying nucleic acid sequences for antisense 
oligonucleotide targeting. The method uses a control computer that includes a 
nearest-neighbor nucleic acid pair value data list. The nearestTueighbor nucleic acid 
pair value data list is determined by referring to a set of predetemiined nucleic acid 

30 nearest-neighbor bond comparisons. The thermodynamic energies needed for 
splitting a combination of nearest-neighbor base pairs apart are used to determine 
the ranking of the nearest-neighbor nucleic acid pairs, and, thus the sequence of 



With respect to accessible site identifying and targeting methods, reference is 
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priority in which the location of antisense pairing is sought. A target sequence is 
then received by the computer and analyzed. The computer program uses 
combinations of nearest-neighbor base pair stabilities, rather than rely on 
assigmnents of individual nearest-neighbor base pair stabilities. 

5 Each of these references provides accessible site identifying and targeting 

features. However, it has been found desirable to be able to determine specific 
structural characteristics of a target RNA molecule for improved accessible site 
identification and targeting. 

With respect to techniques for determining structural characteristics of an 

10 RNA molecule, reference is made to the following: 

Zuker, M., On finding all suboptimal foldings of an RNA molecule. Science 
244, 48-52 (1989); Zuker, M., The use of dynamic prograroming algorithms in RNA 
secondary structure prediction. In Waterman, M. S. (Ed.), Mathematical Methods for 
DNA Sequences, CRC Press, Boca Raton, FL, pp. 159-184 (1989); and Zuker, M. 

15 and Stiegler, P., Optimal computer foldmg of large RNA sequences using 

thermodynamics and auxiliary inforaiation. Nucleic Acids Res, 9, 133-148(1981) 
describe the so-called mfold algorithm, developed with dynamic progr ammin g 
algorithms, that predicts optimal folding through firee energy minimization and 
presents suboptimal foldings. 

20 These suboptimal foldings have limitations due to algorithmic design, and 

they do not guarantee a statistically valid sample of probable structures. 

Wuchty, S., Fontana, W., Hofacker, LL., Schuster, P., Complete suboptimal 
folding of RNA and the stability of secondary structures. Biopolymers 49 (2), 
145-65 (1999) proposes complete enumeration of a large number of all possible 

25 structures with firee energies within a threshold of the global minimum. 

This approach is computationally prohibitive for sequences of even moderate 

length. 

McCaskill, J.S., The equilibrium partition fimction and base pair binding 
probabilities for RNA secondary stracture. Biopolymers 29, 1 105-1 1 19 (1990) 
30 includes a probabilistic algorithm using the partition fimction approach that 

computes base pairing probabiUty and the binding probability for any base. A C 
program for this algorithm is available in a suite of RNA secondary structure 
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software known as the Vienna RNA package. This package was developed by a 
theoretical chemistry group at the University of Vienna (Hofacker et ai, 
www.tbi.univie.ac.at/~ivo/RNA/). 

However, the algorithm does not generate any secondary structures. 
5 As such, there is a need for an efficient and statistically unbiased method of 

predicting structural characteristics of an RNA molecule, in particular an mRNA or 
viral RNA molecule for antisense nucleic acid applications. 
The following are hereby incorporated by reference: 
Allawi, H.T.. Dong, F., Ip, H.S., Neri, B.P., Lyamichev, V.I. (2001) 
10 Mapping of RNA 

accessible sites by extension of random oligonucleotide libraries with reverse 
transcriptase. iW^ 7, 314-27. 

Altuvia, S., Komitzer, D., Te£f, D., Oppenheim, A.B. (1989) Altemative 
mRNA 

15 structures of the cEI gene of bacteriophage lambda determine the rate of its 
translation initiation. JMol Biol. 210, 265-80. 

Ambros, V. (2001) microRNAs: tiny regulators with great potential. Cell. 

107,82 3-6. 

Asano, K., Niimi, T., Yokoyama, S., and Mizobuchi, K. (1998) Structural 
20 basis for binding of the plasmid ColIb-P9 antisense hic RNA to its target RNA with 
the 5'-rUUGGCG-3' motif in the loop sequence. JJBiol Chem. 273, 11826-11838. 

Bennett, C.F., Cowsert, L.M, (1999) Antisense oUgonucleotides as a tool for 
gene functionalization and target validation. Biochim Biophys Acta. 1489 (1), 19-30. 
Berzal-Heiranz, A., Joseph, S., Chowrira, B.M., Butcher, S.E., Burke. JM. 

25 (1993) 

Essential nucleotide sequences and secondary structure elements of the hairpin 

ribozyme. EMBOJ. 12, 2567-73. 

Bonhoeffer, S., McCaskill, J.S., Stadler, P.F. , Schuster, P. (1993) Eur. 
Biophys. J. 22, U-24. no. 
30 Brookes, A.J. (1999). The essence of SNPs. Gene 234 (2), 177-86. 

' Brower, V. (1998). Genome II: the next frontier. Nat Biotechnol 16 (1 1), 
1004. 
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Brown, J.W. (1999) The Ribonuclease P Database. Nucleic Acids Res. 27, 

314. 

Brown, P.O., Botstein, D. (1999). Exploring the new world of the genome 
with DNA microarrays. Nat Genet. 21 (1 Suppl), 33-7. 
5 Burgess, T.L., Fisher, E.F., Ross, S.L„ Bready, J.V., Qian, YX, Bayewitch, 

L.A., Cohen, A.M., Herrera, C.J., Hu, S.S., Kramer, T.B., et al. (1995) The 
antiproliferative activity of c-myb and c-myc antisense oHgonucleotides in smooth 
muscle cells is caused by a nonantisense mechanism. Proc. Natl. Acad. Sci. U SA. 
92 (9), 4051-5. 

10 Cazenave, C, Loreau, N., Thuong, N.T., Touhne, J.J., and Helene, C. (1987) 

Enzymatic amphfication of translation inhibition of rabbit beta-globin mRNA 
mediated by anti-messenger oHgodeoxynucleotides covalently Unked to intercalating 
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Cech, T.R., Zaug, A.J., Grabowsld, P.J. (1981) In vitro spUcing of the 
15 ribosomalRNA 

precursor of Tetrahymend: involvement of a guanosine nucleotide in the excision of 
the intervemng sequraice. Ce// 27, 487-96. 

Cech, T.R., Bamberger, S.H., Gutell, R.R. (1994). Representation of the 
secondary and tertiary structure of group I introns. Nat Struct. Biol. 1, 273-80. 
20 Chiistoffersen, R.E., McSwiggen, J.A., Konings, D. (1994) AppUcation of 
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OBJECTS AND SUMMARY OF THE INVENTION 
5 The present invention was made in consideration of the above problem and 

may have as an object the provision of an efficient and statistically unbiased method 
of predicting structural characteristics of a nucleic acid molecule. 

Another object of the invention can be to provide a method of predicting 
structural characteristics of an RNA molecule for identifying accessible sites for 
10 targeting by antisense nucleic acids (antisense oligos, frans-cleaving ribozymes, 
short interfering RNAs (siRNAs), and antisense RNAs), for predicting molecular 
interactions, and for design of nucleic acid probes. 

Other objects and advantages of the invention may in part be obvious and 
may in part be £^parent from the specification and the drawings. 
15 To address the above-described problems and objects, a novel RNA folding 

algorithm is provided. The algorithm has been shown to offer substantial 
improvement for predicting single-stranded regions in RNA secondary structure. 
These unstructured regions are important for binding by antisense nucleic acids. 
Thus, use of the algorithm in methods and computer systems implementing such 
20 methods can offer an improvement in predicting single-stranded regions in RNA 
secondary structure; and predicting single-stranded regions in RNA secondary 
stracture is usefid in antisense, ribozyme and RNAi techniques and other 
applications, e.g., as discussed herein and in documaits incorporated herein by 
reference. 

25 In accordance with an embodiment of the invention, a computer system (say, 

a general purpose computer), which may include a processor, may be used for 
executing a number of system intaface and statistical analysis instiiictions (e.g., 
software applications), which may include an embodiment of the algorithm of the 
present invention. The system may further include an interface for receiving 

30 sequence information (from, say, a memory device storing fragments for sampling, 
user input, a sequencing apparatus, and the tike) and outputting structural 
information, programming interface for programming new models (e.g., targeting 
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criteria) and functionality, and the like. The system may also be part of any 
integrated system for secondary structure and/or target accessibiUty prediction, 
antisense nucleic acid design, nucleic acid probe design, and the like. 

The statistical sampUng algorithm for RNA secondary structure prediction 
5 according to an embodiment of the invention generates a statistically representative 
sample of probable structures according to the Boltzmann probabilities of RNA 

secondary structures: 

P(I) = (l/U)&x.p[-E(SJ)/RT\ 
where 5 is an RNA sequence, /is a secondary structure, E^SJ) is the free energy of 

10 the structure for the sequence, R is the gas constant, T is the absolute temperature, 
and U is the partition function for aU admissible secondary structures of an RNA 
sequence, i.e., U^3sQ^['EiSJ)/RT\. SampUng from the Boltzmann distribution is 
desirable, because it provides a conq>lete statistical charactenaation of the ensemble 
of probable structures. However, because there are an exponential number of 

15 possible structures for an RNA sequence, the usual statistical sampUng from a 

discrete probabiUty distribution is not feasible. The solution is to employ a recursive 
algorithm. The algorithm in accordance with an embodiment of the invention 
consists of two steps: in the forward step, partition functions are computed; in the 

sampling step, sampling probabiUties are computed and a sample of stiiictures are 
20 generated. The improvements in stnicture predictions and important features 
previously unavailable are demonstrated below. 

Pr^K^hility Prnfilinp for Prediction of Accessibili ty for Targeting by Antisense 
Nucleic Acids 

For target accessibiUty evaluation, it is important to predict the chance that a 
25 segment of consecutive bases is single-stranded. Several unpaired bases in a row are 
important for the nucleation step of hybridization, which estabUshes stable stacking 
necessary for hybridization elongation. This need is addressed by extending a 
sampUng algorithm in accordance with an embodiment of the invention for the 
construction of a probabiUty profile for a target RNA molecule. There are several 
30 advantages to the profile approach to target accessibiUty prediction. There is a 
significant correlation between hybridization potential predicted by tiie probabiUty 
profile and the degree of translation inhibition. In contrast, there is a lack of 
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correlation with the mimmum free energy structure (e.g., computed by mfold), and 
also a lack of correlation wiih previously proposed hoc thermodynamic indices. 
In designing antisense oligonucleotides using mfold, a practical problem is how to 
select a secondary structure for the target RNA from the optimal structure(s) and 
5 many suboptimal structures with similar free energies. By summarizing the 
information from a statistical sample of probable secondary structures in a single 
plot according to an embodiment of the invention, the probability profile not only 
presents a solution for this dilemma, but also reveals 'Veil determined" 
single-stranded regions through the rigorous assigranent of pix)babilities as measures 
10 of confidence in predictions. 

Rational Design of RNA-Tar ppiing Therapeutics 

The probability profile generated in accordance with the invention reveals 
regions with high potential for hybridization between the target and an antisense 
nucleic acids. The identification of these regions provides usefiil input for tiie 
15 rational design of potent antisense oUgos, trans-clesmng ribozymes and siRNAs as 
RNA-targeting therapeutics. The probability profile approach offers a 
comprehensive computational screening for the entire mRNA or viral RNA. For 
several mRNA sequences with length ranging from 1 kb to 3 kb, fifteen to twenty 
high hybridization sites per kb have been observed. These sites provide ample 

20 opportunities for the design and testing for potent antisense nucleic acids. This 
could be usefiil for the development of RNA-targeting therapeutics. 
Functional Genomics and Drug Tar get Validation 

The completion of the sequencing of tiie human genome signals the dawn of 
a new era in biomedical research. Of the estimated 30,000 ! 40,000 genes in the 

25 human genome, definitive fimctions have been assigned to only a few percent. 

Functional genomics is concerned with the determination of biological fimctions for 
all of the genes and their protein products on a genome-wide scale. Ihactivation of a 
gene is the classical approach to assign a fimction to a gene in higher organisms, in 
the post-genomic era, however, gene knockout and mutagenesis, the traditional 

30 "gold standard" tools, can no longer keep pace with new sequence information 
rapidly accumulated from various genome projects. Therefore, antisense nucleic 
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acids that target mRNA have emerged as attractive reverse genetic tools for high 
throughput functional genomics. 

Thousands of new potential therapeutic targets have emerged from human 
genome sequencing. The selection and validation of molecular targets may be very 
5 useful for drug development in the new millennium. Antisense nucleic acids are 
useful tools for the vaUdation of human therapeutic targets by means of gene 
modulation. 

Hi^ Throu pjbout A pplications 

DNA expression arrays have emorged as major high-througlq)ut 

10 experimental tools in the post-genomic era. DNA expression arrays can provide 
important clues to gene function through statistical clustering analysis. Gene 
expression data tend to organize genes into functional categories. Genes with 
unknown function can be assigned tentative functions or a role in a biological 
process based on the known function of genes in the sam.e cluster. Single-nucleotide 

15 polymorphism ("SNP") databases enable studies of the association between a SNP 
and the risk of a disease or drug response. These associations are valuable for the 
identification of candidate genes for disease phenotypes. 

The eventual determination of the functions of the candidate genes and 
confirmation of gene functional predictions based on analysis of DNA expression 

20 arrays will resquire experimental analysis in a systematic and high throughput fashion 
to keep pace with the fest-growing genome, expression array and SNP databases. 
Antisense nucleic acids are well suited for this endeavor. Expression array and SNP 
databases can provide the basis for high throughput applications to functional 

genomics and drug target validation. 

25 The invention accordingly comprises the several steps and the relation of one 

or more of such steps with respect to each of the others, and the apparatus 
embodying features of construction, combination(s) of elements and arrangement of 
parts that are adapted to effect such steps, all as exemplified in the following 
detailed disclosure, and tiie scope of the invention may be mdicated in the claims. 

30 It is noted that in this disclosure, terms such as "comprises", "comprised", 

"comprismg" and the like can have the meaning attributed to it in U.S. Patent law; 
e.g., they can mean "includes", "included", "including" and the like. 
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These and other embodiments are disclosed or are obvious from and 
encompassed by, the following Detailed Description. 

RRTEF DESCRTPTTON OF TTm DRAWBVGS 
The foUowing Detailed Description, given by way of example, but not intended to 
5 limit the invention to specific embodiments described, may be understood in 
conjunction with the accompanying Figures, incorporated herein by reference, in 
which: 

Fig. 1 is a diagram Ulustrating a system configuration 100 in accordance with 
an embodiment of the invention. 
10 Fig. 2 is a diagram illustrating types of structural elements in KNA 

secondary structure: heUx, hairpin loop, bulge loop, interior (internal) loop and 

multi-branched loop. 

Fig. 3 is a secondary structural diagram for the minimum free energy 
structure of Xlo 5S rRNA and types of structural elements: heUx (formed by stacked 
15 base pairs), bulge (B loop), interior loop a loop), hairpin loop (H loop), and multi- 
branched loop (M loop). 

Fig. 4 is a diagram iUustrating mutually exclusive cases in the derivation of 
recursion for uiUj) (equation (4)). by considering a fragment Rj/ being single 
. stranded or the base pair n -n closest to the 5' end of the fragment (i.e., the first (h-i) 
20 bases are single stranded): (a) Ry is single stranded; (b) h=i, l=j; (c) z<h< (d) h=i 
</</;(e)i</i</<(. 

Fig. 5 is a flow chart diagram iUustrating an algorithm for sampling an RNA 
secondary structure m accordance with an embodiment of the invention, where for 
Oj,i)forthefragmentfromanjthtoaythbase,/=lifitisknown theendsfoima 

25 pair, and 7=0 if this pair is unknown. 

Fig. 6 is table (Table 1) demonstrating that the algorithm samples secondary 
structures exactly and rigorously from the Boltzmann equiUbrium probability 
distribution (equation (1)). 

Fig. 7 is a table (Table 2) demonstrating the fast sampUng step of the 

30 algorithm. 

Figs. 8A, 8B, and 8C are two-dimensional histograms (2Dhist) for classes 
1 A, IB and IC for i. collosoma SL RNA. The 2Dhist shows the frequencies of base 
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pairs on the upper left triangle with nucleotide position on both axes. Within each 
histogram, the sizes of the solid squares are proportional to the frequencies of the 
base pairs. 

Figs. 9 A and 9B are two-dimensional histograms for the classes 2A and 2B 
5 for X. co/ZoJoma SLRNA. 

Figs. lOA, lOB and IOC are diagrams illustrating the representative 
structures for classes lA. IB and IC for L. collosoma SL RNA based on an 
algorithm in accordance with an embodiment of the present invention, where Fig. 
lOA shows structure form 1 for class lA. Fig. lOB shows the optimal folding by 
10 mfold for classlB, and Fig. IOC shows the representative for class IC. 

Figs. IIA and UB are diagrams illustrating the representative structures for 
classes 2A and 2B for L. collosoma SL RNA based on an algorithm in accordance 
with an embodiment of the present invention, where Fig. 11 A shows structure form 
2 for class 2A and Fig. 1 IB shows the representative for class 2B. 
15 Fig. 12 is a table (Table 3) for ) listing the classification, representation, and 

statistical characterization of the probable secondary structures of the Boltzmann 
ensemble for L. collosoma SL RNA by the examination of a statistical sample of 
1,000 secondary structures based on an algorithm in accordance with an 
embodiment ofthe present invention. i 
20 Fig. 13 is a bar plot for comparing the probabiUty (estimated by the 

frequency in a sample) of a class (white boxed bar) with the Boltzmann probabiUty 
(black bar) for the representative stiiictiire of a class. Classes are from the structiire 
classification for L. collosoma SL^K. 

Fig. 14A, 14B and 14C are diagrams of alternative structiires for cHI mRNA. 
25 The initiation codon and the Shine-Dalgamo sequence are aW and 
Ur^AAGGAGr'. The substiiictiire from the 5* end to nucleotide A(!9) is the same 
for stiiicture A and structure B, where Fig. 14A shows experimental stricture A, 
Fig. 14B shows experimental structiire B, and Fig. 14C shows stiiictiire C 
representing a modification of B by an additional short helix involving a part of the 
30 Shine-Dalgamo sequence. 
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Fig. 1 5 is a table (Table 4) listing probability estimates of structural motifs 
for cin mRNA from a sample of 100 structures based on an algorithm in accordance 
with an embodiment of the present invention. 

Figs. 16A, 16B and 16C are diagrams illustrating the free energy 
5 distributions of sampled structures for L. collosoma SL RNA, where Fig i6A 
illustrates the Boltzmann-probability-weighted density of states (BPWDOS), Fig. 
16B displays the distribution for the probabiUty that the free energy of a structure is 
withm a threshold of global minimum, and Fig. 16C displays the distribution for the 
probability that the fi«e energy of a structure is within an energy interval. 
10 Figs. 17A, 17B and 17C are diagrams illustrating the free energy 

distributions of sampled structures for 5. coli KNase P (377 nt), where Fig 17A 
illustrates the Boltzmann-probabiUty-weighted density of states (BPWDOS), Fig. 
17B displays the distribution for the probabiUty that the free energy of a structure is 
within a threshold of global minimum, and Fig. 17C displays the distribution for the 
15 probability that the free energy of a structure is in an energy interval. 

Figs. 18A and 18B are diagrams ilhistratmg probability profiles for 
Escherichia CoU ("£. co/i") tRNA^*, with sampling estimates computed from 1,000 
sampled secondary structures based on an algorithm in accordance with an 
embodiment of the invention, where Fig. 18A shows the probabiUty profiles for 
20 single-stranded nucleotides (segment width J^l) indicated by the phylogenetic 
stmcture (large dots) and by the mmimum free energy structure (vertical bars), 
estimated by flie sampling algorithm (short dashed Une), and computed by the 
Vienna RNA package (long dashed Une) (For the region between and the 
sampUng estimate predicts the phylogenetic structure substantially better than the 
25 Vienna RNA package), and Fig, 18B shows the probabiUty profiles for 

single-stranded sequences of four consecutive nucleotides (segment width W=A) in 
E.coli tRNA^* indicated by the phylogenetic structure Qarge dots) and by the 
minimum free energy structure (vertical bars), and estimated by the sampUng 
algorithm (dashed Une). (The probabiUty profile cannot be computed by the Vienna 
30 RNA package or other existing algorithms.) 

Figs. 19 A, 19B, 19C,19D are diagrams illustrating probability profiles 
(segment width 17=4) for other representative RNA sequences, with sampling 
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estimates computed from 1,000 sampled secondary stractures based on an algorithm 
in accordance with an embodiment of the invention, where Fig. 19A presents the 
profile indicated by the phylogenetic structure (the large dots), the sampling estimate 
(the dashed line), and the minimum f^ee energy structure (vertical bars) foxXenopus 
5 /aevis oocyte 5S rRNA, Fig. 19B presents the profile for co/H 6S rRNA domam 
n, Fig. 19C presents the profile for E. coli RNase P, and Fig. 19D presents the 
profile for Group I intron from 26S rRNA of Tetrahymena thermophila. The smaU 
soUd squares (adjacent squares appear to form Une segments) presait the profile 
indicated by phylogenetic structure, the dashed line is the sampUng estimate, and the 
10 vertical bars represent the minimum free energy structure. For the Tetrahymena 
Group I intron, a six base pair double-stranded region called P3 in the phylogenetic 
structure is not considered here because of the creation of apseudoknot. The current 
sampling algorithm may be extended to predict certain types of pseudoknots. 
Fig. 20 is a table (Table 5) showing a correspondence between 
15 phylogenetically determined single-stranded regions and peaks on the probability 
profile based on an algorithm in accordance with an embodiment of the present 
invention and improvement in predictions over minimum free energy structure. 

Fig. 21 A and 21B are diagrams illustrating contrasting predictions by 
probabiUty profile (JF=4) and mfold MFE stmcture for nt 1!60 region and nt 
20 126211322 region of the mRNA for Homo sapiens gamma-glutamyl hydrolase 
(GenBank Accession No. U55206, with 66 additional nucleotides at the 5' end). 

Fig. 22 is a table (Table 6) showing a comparison of inhibition of rabbit 
/3-globin synthesis in cell-free translation systems and hybridization potential 
predicted by probability profile for rabbit ^-globin mRNA based on an algorithm in 
25 accordance with an embodiment of the present inventioiL 

Fig. 23 is a table (Table 7) showing a comparison of the intensity of 

ASO-mRNA hybridization on the oligbdeoxynucleotide array and the probabiUty 
profile for the first 122 bases of rabbit /3-globin mRNA based on an algorithm in 
accordance with an embodiment of the present invention. 
30 Fig. 24 is a diagram iUustrating the probability profile for single-stranded 

sequences of four consecutive nucleotides (segment width W=4) estimated by 1,000 
sampled secondary structures (dashed Une) based on an algorithm in accordance 
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with an embodiment of the present invention and the profile indicated by the 
minimum free energy structure (vertical bars) for rabbit ^-globin mRNA and the 
experimentally measirired inhibition of antisense oligomer (ASOs) in cell-free 
translation systems. (The profile is shown for the region of the first 230 nucleotides 

5 that is targeted by the ASOs. The length and binding sites of the ASOs are indicated 
by horizontal lines with the names of the ASOs centered above or below the lines. 
These lines also indicate the inhibition of translation through their position on the 
vertical axis. The vertical axis also shows the probability for the profile with 
inhibition corresponding to probability multiplied by 100%). 

10 Fig. 25 is a diagram illustrating the complete probability profile for single- 

stranded segments of four consecutive nucleotides (segment width=4) estimated by 
1 ,000 sampled secondary structures for E. coli lacZ mRNA. 

. Fig. 26 is a diagram illustrating the nt 2200 - 2400 portion of the probabihty 

profile for E. coli lacZ mRNA. 
15 Fig. 27 is a table (Table 8) Usting ten antisense Oligos rationally designed by 

probability profiling and calculation of binding energies. 

Fig. 28 is a diagram illustirating the concept of mutual accessibility for 
RNA:RNA interactions. The seven A bases are accessible in RNA 1, and their 
complementary bases, the seven Us are also accessible in RNA 2. 
20 Figs. 29A and 29B are diagrams illustrating a graphical method for tiie 

assessment of mutual accessibiUty between a target RNA and an antisense RNA or a 
ribozyme. For a 60-nt antisense RNA (embedded in a long RNA through an 
expression vector) and the targeted mRNA of Homo sapiens gamma-glutamyl 
hydrolase, Fig. 29A shows good mutual accessibility through flie overlapping high 
25 probabiUty region between nt 730 and nt 750 on the overlaid probabihty profiles 
(segment width W=4) at the target site. For the mRNA of the Breast Cancer 
resistance Protein (BCRP) and the binding 'aims of a hammerhead ribozyme 
designed for a GUC cleavage sequence on the target. Fig. 29B shows fairly good 
mutual accessibihty through the overlapping high probabihty segments formed by 
30 nucleotide 1 889, 1 890, 1 89 1 and 1 892 for the 3' binding arm and by nucleotide 

1905, 1906, 1907 and 1908 for the 5' binding aim, respectively (segment width W=l 
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for the overlaid probability profiles). Mutual accessibility for a segment of at least 
four consecutive bases may be necessary for antisense nucleation. 

Fig. 30A and SOB are diagrams illustrating models of secondary structures 
and sequence requirements for the hammerhead ribozyme with targetmg sequence 
NUH9 and the hairpin ribozyme with targeting sequence BN9GUC, where N=A, C, 
G or U; H=A,C or U; N' or H' are complementary nucleotide of N or H'; Y is a 
pyrimidine nucleotide (U or C); R is a purine (A or G); B=C, U or G; V=G, A or C; 
bold letters indicate invariant nucleotides; arrow indicates Ihe site if cleavage. 

Fig. 3 1 is a diagram illustrating the probability profile for exon 3 (nt 1 003- 
0 11 19) of human estrogen receptor 1 (ESRl) mRNA (6450 nt, GenBank Accession 

No. NM_000125) . 

Fig. 32 is a table (Table 9) of siRNAs rationally designed with probability 
profiling to target AA(N19) motife in exon 3 of the human estrogen receptor 1 
(ESRl) mRNA. 

15 Fig. 33 is a diagram illustrating a high throughput firamework for functional 

genomics, drug target vaUdation, and elucidation of genetic pathways. Systematic 
statistical analysis of DNA expression arrays and SNP databases can provide the 
basis for high throughput functional analysis. Integration of computational design of 
antisense nucleic acids and experimental techniques (e.g., oligonucleotide array) 
20 presents a rational, efficient and higji throughput platform. 

DETAILED DESCRIPTION OF THE PRE FERRED EMBODIMENT 
Fig; 1 is a diagram iUustrating a system configuration 100 in accordance with 
an embodiment of the invention. As shown in Fig. 1, system 100 may comprise a 
) conq)uting device 1 05, which may be a general purpose computer (such as a PC), 
25 workstation, mainframe computer system, and so forth. Computing device 105 may 
include a processor device (or centi^ processing unit "CPU") 1 10, a memory device 
1 15, a storage device 120, a user interface 125, a system bus 130, and a 
communication interface 135. CPU 1 10 may be any type of processing device for 
carrying out instructions, processing data, and so forth. Memory device 115 may be 
30 any type of memory device including any one or more of random access memory 
("RAM"), read-only memory ("ROM"), Flash memory, Electrically Erasable 
Programmable Read Only Memory ("EEPROM"), and so forth. Storage device 120 
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maybe any data storage device for reading/writing from/to any removable and/or 
integrated optical, magnetic, and/or optical-magneto storage medium, and the like 
(e.g., a hard disk, a compact disc-read-only memory "CD-ROM", CD-ReWritable 
"CD-RW", Digital VersatUe Disc-ROM "DVD-ROM", DVD-RW, and so forth). 

5 Storage device 120 may also include a controller/interface (not shown) for 

comiecting to system bus 130. Thus, memory device 115 and storage device 120 are 
suitable for storing data as well as instmctions for programmed processes for 
execution on CPU 110. User interface 125 may include a touch screen, control 
panel, keyboard, keypad, display or any other type of interface, which may be 

10 connected to system bus 130 through a corresponding input/output device 

interface/adapter (not shown). Communication interface 135 maybe adapted to 
communicate with any type of external device, system or network (not shown), such 
as one or more computing devices on a local area network ("IAN"), wide area 
network (**WAN"), the intemet, and so forth. Interface 135 may be comiected 

15 directly to system bus 130, or may be comiected through a suitable interface (not 
shown). 

While the above exemplary system 100 is illustrative of the basic 
components of a system suitable for use with the present invention, the architecture 
shown should not be considered limiting since many variations of the hardware 

20 configuration are possible without departing from the present invention. As 

described above, system 100 provides for executing processes, by itself and/or in 
cooperation with one or more additional devices, that may include statistical 
algorithms for prediction of secondary structure of nucleic acids and for prediction 
of accessible target sites and rational design of antisense oligos, trans-cleaving 

25 ribozymes, and siRNAs for human therapeutics and functional genomics and dmg 
target validation and nucleic acid probe design in accordance with the present 
invention. System 100 may be programmed or instmcted to perform these processes 
according to any communication protocol, programming language on any platform. 
Thus, the processes may be embodied in data as well as instructions stored in 

30 memory device 1 1 5 and/or storage device 120 or received at interface 135 and/or 
user interface 125 for execution on CPU 110. Exemplary processes for carrying out 
the invention will now be described in detail. 
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Single-Stranded regions in ribonucleic acid C^RNA") secondary structure are 
important for RNA:DNA, RNA:RNA and RNA:protein interactions. In accordance 
. with an embodiment of the invention, a probability profile e^iproach may be iised for 
the prediction of these regions based on a statistical algorithm for sampling RNA 

5 secondary structures. For the prediction of phylogenetically determined 

single-stranded regions in secondary structures of representative RNA sequences, 
the probability profile offers substantial improvement over the minimum free energy 
structure. In designing antisense nucleic acids, a practical problem is how to select a 
secondary structure for the target RNA from the optimal structure(s) and many 

10 suboptimal stractures with similar free energies. By summarizing the information 
from a statistical sample of probable secondary structures in a single plot, the 
probability profile not only presents a solution to this dilemma, but also reveals 
"well-determined" single-stranded regions through the rigorous assignment of 
probabilities as measures of confidence in predictions. . In antisense application to 

15 the rabbit jS-globin mRNA, a significant correlation between hybridization potential 
predicted by the probability profile and the degree of inhibition of in vitro translation 
suggests that the probability profile approach is valuable for the identification of 
accessible target sites. Coupling computational design and experimental techniques 
(e.g., oligonucleotide array) provides a rational, efficient framework for antisense 

20 nucleic acid screening. This framework may be used for high throughput 
applications to fimctional genomics and drug target vahdation. 

In accordance with an embodiment of the present invention, the RNA folding 
problem may be formulated in a statistical framework, and die partition fimction 
method may be extended for generating a statistically representative sample of the 

25 probable structures. 

In accordance with an embodiment of the invention, a sampling approach for 
the prediction of single-stranded regions in an RNA molecule may be used. While 
the structural profile provided by the inventive approach is usefiil for the important 
antisense nucleic acid applications, single-stranded regions, particularly 

30 destabilizing loops, can play many miportant fimctional roles. These include, e.g., 
protein binding, ribozyme binding and catalysis, binding by siRNAs and antisense 
RNAs, regulation of cellular processes, pseudoknot formation and tertiary 
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interactions for kissing hairpins, bulge-loop complexes, hairpin loop-internal loop 
complexes, and so forth. For these q>plications, computational prediction of single- 
stranded regions can also be helpful for the experimental design for structure 
probing by ribonucleases ("RNases") or chemical means. 
5 A regulatory mechanism has been recognized where an oUgonucleotide can 

bind to a messenger RNA through complementary base pairing to block its 
translation. As antiviral agents, antisense oligonucleotides can inhibit replication of 
RNA viruses. The discovery that oligonucleotide can play a regulatory role in gene 
expression led to the development of the antisense strategy to artificially control 

10 gene expression. Although variable degrees of success have been achieved in the 
apphcation of antisense methods to the research of biological phenomena and human 
disease treatment, it has been proven that antisense ohgonucleotides are able to 
modulate gene expression in both prokaryotes and eukaryotes. 

For antisense ohgonucleotides to be effective, the complementary target 

15 sequence on mKNA or viral RNA must be available for hybridization. RNA 

nucleotides can be inaccessible when they are sequestered in secondary structure. 
The usually weaker tertiary interactions and RNA-protein interactions can also be 
factors that affect accessibility. The identification of regions hkely to remam 
single-stranded in RNA secondary structure is an important part of antisense 

20 technology. 

Target RNA structures play a significant role in determining antisense 
oligonucleotide efficacy in vivo. Discovery of active antisense oligonucleotides 
requires identification of unstructured of the target in tiie cellular environment. The 
tightest binding of antisense oligonucleotides occurs at target sites for which 

25 disruption of the target structure is minimal, and single-stranded regions should be 
selected over double stranded regions in the consideration of target sites. There is a 
correlation.between single-stranded specific probes and accessible sites for antisense 
targeting, but there aire a. few exceptions, probably due to steric hindrance that Umits 
RNase H access. It has been speculated that duplex formation is initiated at an 

30 accessible substructure that includes a site for nucleation with unpaired bases and 
then propagates from tiie nucleation site through a "zippering" process. A hairpin of 
four unpaired bases can be involved in hybrid formation. 
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A few secondary-structure-prediction-based computational approaches to the 
evaluation of potential antisense targets have been reported. Thennodynamic 
indices may be generated by averaging relevant free energies of secondary structures 
generated fmm a Monte Carlo KNA folding algorithm based on an evolutionary 
5 heuristic. Because this Monte Carlo algorithm does not guarantee the generation of 
a valid statistical sample of low energy structures, the most likely structure identified 
using this algorithm may not necessarily be the lowest free energy structure. 

For the genomic RNA (about 9700 nt) and the complementary RNA strand 

of the human immunodeficiency virus type 1 ("HIV-l"), local folding potential can 
10 shed Ught on effective antisense targets. The local folding potential may be 

computed for each of successive overlapping segments of a chosen window width 
(ranging from 50 to 400 nt) along the RNA chain, by folding each segment with 
wj/oW.and computing its minimum free energy. This method may be used for 
assessing stable structures in HIV-l, Because long distance interactions and short 
15 term interactions between the nucleotides near the ends of the segment and the 

neighboring nuclwtides outside the segment are ignored, this method appears to be 
reasonable only for relatively long window widths, as it cannot address the 
hybridization potential of individual nucleotides or short sequences. 

The use of only the optimal folding or limited suboptimal foldings from 
20 mfold for antisense prediction is an inherent limitation of the method by Walton et 
al The repeated folding for folding domauis infroduces additional uncertainty in 
predictions. Global disruption ofthe target structure by antisense oHgos is proposed. 
However, an array study suggests that a duplex can only form when hybridation 
elongation requires little perturbation ofthe existmg target structure {Mir & 
25 Southern). This suggests antisense hybridation only disrupts local structure of the 
target. Furtheraiore, substantial human curation appears to be necessary for this 
method. 

A comparative analysis using mfold on twenty-two RNAs has been 
performed. The RNAs were previously studied for selective gene inactivation by 
30 antisense oUgonucleotides and ribozymes, small catalytical RNA molecules that 
specifically bind to target RNAs by complementary base pairing (i.e., antisense 
mechanism) then cleave the target at specific sites. Despite Umited representation of 
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alternative Structures by four or five suboptimal foldings, the analysis found a 
• correlation between the predicted base-pairing accessibility of the targets and the 
experimental efficacy of the antisense reagents. Thus, it has been recommended that 
the cleavage site for ribozymes should fall within a loop of at least four nucleotides, 
5 and one, preferably both, of the 5' and 3' ends of the antisense segment should fall 
within a single-stranded rather than a stem region. Despite the inherent difficulty in 
selecting a representative sample of the suboptimal foldings, addressing the 
hybridization potential using suboptimal foldings from mfold and showing the 
procedure works well for the rat OX40 mRNA has been proposed. 
10 These findings lend additional support to the importance of exploring 

secondary structure in the selection of antisense targets. In accordance with an 
embodiment of the invention, it is desirable to focus on single-stranded regions in 
RNA secondary structure, in particular those of at least four consecutive unpaired 
bases. The Viemia package can calculate the probabiUty of a single base being 
15 unpaired, however it cannot address the hybridization potential of a regioa This is 
not a problem for the sampUng-based probabiUty profile approach utilized in 

accordance with the invention, which can overcome limitations of existing 
. computational approaches. An illustrative embodiment of the inventive approach 
wiU now be described as appHed to representative RNA sequences and an antisense 

20 application to rabbit B-globin mRNA. 

The Nobel Prize-winning discovery of RNA catalysis led to the development 
of ribozyme technology for gene inhibition. Ribozymes are catalytic RNAs that 
possess the dual properties of sequence-specific KNA recognition and site-specific 
cleavage. In other words, they first bind to the RNA target by complementary base 

25- pairing, and then cleave the target at a ^ecific site. Among ribozymes discovered to 
date, the hammerhead ribozyme and the hairpin ribozyme have been of greatest 
interest, due to a number of significant attributes of these smaU ribozymes. These 
attributes include site-specific cleavage, multiple turnover and the ability to be 
exogenousiy deUvered or endogenously expressed from a transcription cassette. In 
30 addition to increased stabiUty, ribozymes may have other potential advantages over 
antisense oUgos: (1) the inhibitory effect of ribozymes may include a contribution 
from the antisense binding step; (2) ribozyme binding to the target is more stringent; 
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and (3) their specificity is higher due to their dual properties of sequence-specific 
binding and site-specific target cleavage. The iranj-cleavage ability makes 
hammerhead and hairpin ribozymes important tools in the elucidation of the function 
of new genes predicted from genome sequencing projects, and in the development of 
5 antiviral agents for therapeutic applications, and in the validation of drug targets. 

For antisense oligos and trans-clesviag ribozymes, it is well understood that 
the accessibility of the target site is among the most important factors for their 
intracellular efficacies. There is compelling experimental and computational 
evidence that, to a large extent, the accessibility of the target to antisense oUgos or 
10 ribozymes is constrained by the secondary structure of the target RNA. For 

ribozyme design, several computational methods make accessibility predictions 
based on mfold. However, these methods cannot escape the Umitations inherent in 
mfold. 

Jjx addition to antisense oligos and ribozymes, RNA interference (RNAi) by 
15 double-stranded RNAs has emerged as a powerful reverse genetic tool to sileace 
gene expression in a wide range of eukaryotic organisms including plants, 
Caenorhabditis elegans, Drosophila, and mice, etc. The discovery that short double- 
stranded siRNAs can mediate RNAi in mammalian cells has further expanded the 
utility of RNAi into mammalian systems. There is experimental evidence that the 
20 potency of siRNAs is also determined by target accessibility. 
Statistical Sa m pling of RN A Secondary Structiires 

A stiiicture sampling algorithm based on free energies for stackmg in helices 
may be used to yield a representative statistical sample of secondary structiires, as 
described in Ding. In accordance witii an embodiment of the invention, tiie 
25 sampling probabilities may be computed using partition functions calculated in the 
forward step of tiie algorithm. For more sophisticated and realistic energy rules, an 
extended algorithm may be used according to an embodiment of the invention. The 
forward step of tiiis algoiitimi may include a recursive algorithm for partition 
fimctions. This recursive algorithm may include free energies for dangUng ends and 
30 other recent free energy parameters. The backward step may take the fonn of a 

sampling algorithm; the sampling probabilities may be computed using the partition 
fimctions computed in ttie forward step. 
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The' extended algorithm may accommodate up-to-date free energy rules and 
parameters. These include free energies for stacking in a heUx, stacking for a 
terminal mismatch in a haiipin loop (size <A) or an interior loop, and penalties for 
hairpin, bulge, interior, and multi-branched loops. Free energies for dangling ends 
5 may be used for exterior and multi-branched loops. For hairpins, a bonus for UU 
and GA first mismatches (included in the terminal stacking data) and a bonus for 
G«U closure preceded by two G nucleotides in base pairs may be appUed, and a 
penalty for oligo-C loops (all unpaired nucleotides are C) may be used. 

The Boltzmann distribution in statistical mechanics gives the probabiUty of a 
10 secondary structure I for an RNA sequence S at equihbrium as 
PiI) = (,l/U)cxip[-EiSJ)m 
where E(S,i) is the free energy of the structure, R is the gas constant, Tis the 
absolute temperature, and t/ is the partition fimction for all admissible secondary 
structures of the RNA sequence, i.e., U=3s &7qp[-E(SJ)/RT\. The extended 
15 algorithm samples exactly and rigorously according to the Bolt2mann distribution 
(1), i.e., it can generate a statistical sample of any desured size from the Boltzmann 
ensemble of secondary structures. The sampling process is similar to the traceback 
algorithm employed the dynamic programming algorithms but differs in that the 
base pairing is randomly sampled from Boltzmann probabilities rather than chosen 
20 to yield a minimum free energy structure. Because the probabiUty of a structure 
decreases exponentially with increasmg free energy, the structure with the highest 
frequency in the sample is most likely the minimum free energy structure. When 
long interior loops (e.g., size > 30 nt) are disallowed, the forward step of the 
algorithm is cubic. The samphng step of the algorithm is stochastically quadratic in 
25 the worst case, thus it can quickly generate a large number of secondary structures. 
Prnbabilitv Profiling for Predi r-tin p Sinde-Stranded Bases and Segments 

From recursively derived partition functions for an RNA sequence of n 
bases, recursions may be used for computing marginal base pairing probabiUty may 
be Vij =Prob(base i and base j form a pair), then the probabiUty that base i is 
30 unbound, i.e., single-stranded, is 9, =1- Sc^.i)<,-^^r2i ^^^ju The base pair binding 
probabiUties are not locaUy determined by the RNA sequence, rather, they reflect a 
sum over all equihbrium weighted structures in which the chosen base pair occurs. 
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Therefore, {gJstatisticaUy describe the antisense hybridization potential for every 
nucleotide in the sequence. Alternatively, the sampling method presents a means to 
estimate qi with the sampling frequency for the unbound base i. This avoids the 
cubic algorithm required to compute the probabilities analytically. A probability 
5 profile is then displayed by plotting {qi} against the nucleotide position. 

However, probabilities {g,} may not provide a suitable means to assess the 
potential of a segments to be single-stranded and available for hybridization. More 
specifically, for a fragment from base i to hasej, Qij. the probabiUty of the fragment 
being single-stranded may not simply be the product of individual probabiUti^ 
10 {qm}, i ^, because independence may be invalidated by the nearest-neighbor 
interactions. However, a probabilistic measure of the hybridization potential of a 
segment can be obtained from a sample of secondary structures. Because the sample 
is representative of the Boltzmann ensemble of secondary structures, the fraction of 
the sample in which all the nucleotides in the segment are single-stranded provides 
15 an unbiased estimate of the probability of the segment being smgle-stranded. For all 
successive overiapping segments of width W, the sampling estimate for the 
probabiUty that a segment is single-stranded can be plotted against the first 
nucleotide of the sequence for a probability profile of single-stranded segments with 
width W. Based on a rule of thumb of at least four unpaired bases, IF may be set to 
20 equal 4 for an antisense application. 

An algorithm in accordance with an embodiment of the invention will now 

be described in detail. 

As mentioned before, a recursive algorithm is presented for the partition 
fimctions of RNA secondary structures based on recent thermodynamic parameters. 

25 A fast statistical algorithm may be used with the partition functions to generate a 
statisitical sample from the Boltzmann ensemble of secondary structures. The 
algorithm presents a statistical solution to the dilemma that presentation of 
suboptimal foldings throu^ a designed suboptimal selection method can be Umited, 
and that, complete enumeration and examination of all suboptimal foldings (with 

30 free energies within a threshold of the global minimum) are difficult. By classifying 
sampled structures, the algorithm enables an efficient statistical delineation and 
representation of the Boltzmann ensemble. Alternative biological stmctures can be 
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revealed by a statistical sample. The sampling algorithm may be applied to 
Leptomonas collosoma ("1. coUosomc^') Spliced Leader ("SL") RNA and mRNA of 
cm Gene of Bacteriophage X, two examples with experimentally demonstrated 
alternative structures. These structures are well predicted by the sampUng algorithm, 
5 while a structure for cm mRNA is poorly predicted by infold as a result of its 
algorithmic design for the selection of suboptimal foldings. Statistical sampling 
provides a means to estimate the probabiUty of any structural motif with or without 
constraint. Furthermore, a probability profile for any specified firagment width can 
be constructed for predicting single-stranded regions in RNA secondary structure. 
10 By overlaying probability profiles, a mutual accessibility plot can be displayed for 
predicting RNA:RNA interaction. The sampling approach offers an effective means 
to address both the uncertainty in structure prediction and the likeliliood of potential 
altranative structures for long-chain RNAs. In particular, the applications show that 
the sampling algorithm can be well suited to structure prediction and assessment of 
15 target accessibility for mRNAs. In addition, Boltzmann-prdbability-wei^ted 
density of states and firee energy distributions of sampled structures can be readily 
computed. Thus, the sampling algoriftm enables important features and tools for tiie 
characterization of the Bolt2mann ensemble of RNA secondary structures. It also 
provides new tools for RNA research, in particular, for the optimal target prediction 
20 and the rational design of antisense nucleic acids for gene down-regulation. 

RNA molecules play a variety of fanportant functional roles that include 
catalysis, RNA splicing, regulation of transcription, and translation. The fimction of 
an RNA molecule is determined by its structure. However, it is extremely difficult to 
crystallize large RNA molecules. To date, crystal structure has been determined only 
25 for a few RNA molecules. Secondary structures are highly conserved in evolution 
for most functional RNAs, e.g., transfer RNAs. On the other hand, RNA tertiary 
structural motifs involve interactions between secondary structure elements. To a 
large extent, RNA folding is driven by secondary structure features. For these 
reasons, elucidation of RNA secondary structure is an important step toward 
30 determination of RNA three-dimensional structure and function. 

The characterization of the full ensemble of probable RNA secondary 
structures has been of great interest, because firom tiie perspective of statistical 
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medhanics, an RNA molecule can exist in an ensemble of structures. For example, a 
messenger RNA (mRNA) may exist as a population of different structures. On the 
other hand, multiple structures are involved in a variety of RNA regulatory 
functions. These include the function of 5S RNA during protein synfliesis, 
5 regulation of translation initiation, and transcription attenuation in enteric bacteria. 

Free energy minimization has been a popular method for RNA secondary 
structure prediction from a single sequence. Although free energy models for 
secondary structure motifs have undergone refinements for more accurate 
characterization of folding thermodynamics, there is stiU uncertainty in the 
10 experimental estimates ofthe parameters. The free energy computed for a structure 
is approximate also because the assumption of free energy additivity and the need to 
extrapolate to loop sequences and loop sizes in the absence of measured estimates. 
The ill conditioning ofthe RNA folding problem by free energy minimization has 
been well noted. 

1 5 Furthermore, the stability of secondary structure motife can be affected by potential 
tertiary interactions that are unaccounted for in secondary structure prediction, and 

little is known about thermodynamic contiibutions of tertiary motife. Hence, the 
tpinim iim free energy structure from a folding algorithm may not be the trae 
stiiictijre, and the true stiiicture may be a suboptimal folding. For these reasons, it is 
20 important to fully characterize and efficientiy represent the Boltzmann ensemble of 
RNA secondary stiiictures. However, existing algorithms have only provided partial 
solutions for addressing above issues. 

The mathematical algorithms by Zuker predict optirnal folding and present a 
designed set of suboptimal foldings within any prescribed P% (0 <P^00) ofthe 

25 global minimum. This is an efficient approach, however, it has its limitations. For 
each admissible base pair, the suboptimal algorithm generates the constrained ■ 
optimal folding with this pair as the constraint. Thus it regenerates the global 
optimal folding if tiie base pair is present in the global optimal folding. For a 
sequence of w nucleotides, and »o base pairs in the optimal folding, at most n(7i-l)/2- 

30 no suboptimal foldings are examined by this algorithm. This set is common for all 
choices of P, and those within P% ofthe minimum free energy are returned by the 
algorithm. For large P and for even moderate n, this a small subset of all the 
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suboptimal foldings Avithin P% of the minimum free energy because as P 
■ approaches 100 the number of all suboptimal foldings increases exponentially with 
n. Furthermore, if the least stable structure from this set is 2% off the global 
minimum, then for P <Q, no new suboptimal foldings are produced. A structure 
5 which is not the constrained optimal folding generated by any of its base pairs is in 
file complementary set of the "missing" suboptimal foldings, i.e., the collection of 
suboptimal foldings excluded by the suboptimal algorithm. For example, structures 
specified by removing one or more base pairs from the optimal folding fall into this 
set. 

10 A recent mathematical algorithm by Wuchiy et al. deals with the computation 

of all suboptimal foldings within any specified increment of the minimum free 
energy. This is a more analytical treatment than an earlier attempt. For this 
algorithm, the number of suboptimal 

foldings and CPU time show exponential behavior as the range of the energy 
1 5 intaval increases. This is the result of exponential number of structures for an RNA 
sequence. For even moderate sequence lengfli and a relatively wide energy interval, 
enumeration and examination of this huge set of suboptimal foldings become 
prohibitive. 

The calculation of equilibrium partition fimctions and base pairing 
20 probabilities is an important advance toward the characterization of the Bolt2mann 
ensemble of secondary structures. However, the elegant algorithm for this 
calculation does hot generate any secondary structure. 

The dilemma that the presentation of suboptimal foldings through a designed 
set can be limited and complete enumeration and examination of suboptimal 
25 foldings are difficult appears to be unpossible to solve by a mathematical treatment.. 
While conventional algorithms fall short of the objective of efficient and statistically 
unbiased representation of suboptimal foldings, statistical sampling ^proach may 
not only demonstrate the optimal folding or its close resemblance, but also 
efficiently summarize the suboptimal foldings and reveal potentially important 
30 alternatives. 

hi accordance with an embodiment of the invention, an algorithm for 
partition fimctions that are based on recent free energy parameters is provided. In 
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addition, an algorithm based on these energy parameters and the partition functions 
to sample exactly and rigorously from the Boltzmann distribution is provided. 
Prediction of alternative structures presents a challenging test on an algorithm 
because there are two structures to be predicted. The capability of an algorithm 
5 according to an embodiment of the present invention for predicting alternative 

stractures is demonstrated with appUcations to L. collosoma SL RNA and mRNA of 
cm Gene of Bacteriophage "k, two examples with experimentally demonstrated 
altemative structures. The classification of probable structures for L. collosoma SL 
RNA and probability estimates of structural motifs for cin mRNA are also 
10 demonstrated. 

. Computing Partition Functions 

For an RNA molecule of n ribonucleotides, the sequence from the zth 
ribonucleotide from the 5' end to the yth ribonucleotide may be denoted by Ry = r,rf+i 
. . .rj, 1 < Uj <n, where r, =A, C, G, or U. Elements of RNA secondary structure are 
15 illustrated by Figs. 2 and 3. Let ly be a secondary structure on that meets the 
usual constraints of unknotted structure and that there are at least three intervening 
bases between any base pair. For stmctures under the constraints, let IPfj be a 
stracture on Ry with the ends constrained to form a base pair. The partition 
functions restricted to Ry may be defined as: 
20 u(ij)'=IJ[i,SKp[-E(RijJy)/RT\ (2) 

up(ij).^m>geKp[-EiRijJPgyRT\ (3) 
where E(R y, hj) is the free energy of stracture Iij for % i? is the gas constant, f is 
the absolute temperature, and kcal/mol/i?r=l .6225. Recursive calculation of 
partition functions may be used for computing base pair probabilities. The 
25 recursions presented below extend such by including all but coaxial stacking from 
recent fi«e energy parameters. Also, the recursions are presented in a fashion such 
that sampling probabiUties can be readily derived. 

When a single stranded base is adjacent to two heUces, it may be the case 
that only the 3' dangling is considered because it is usually more energetically 
30 favorable than 5' dangUng according to the free energy data for dangUng ends. The 
assumed additivity of free energy imphes multiplicativity of contributions by 
structural elements to the partition fimctions. The contiibutions to tiie partition 
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functions by mutually exclusive conformational classes are, however, additive. 
These features are important in the derivation of a repiirsive algorithm. As 
illustrated by Fig. 4, for fragment R ,y, by considering it being single stranded or the 
base pair n -n closest to the 5' end of the fragment (i.e., the first (^-0 bases are 
5 single stranded), the following mutually exclusive and exhaustive cases may be 
included: (a) Rt/ is single stranded; (b) h=i, /=y; (c) i<h< H\ (d) h=i < l<j\ (e) i<h< 
l<j. Thus, u{ij) is a sum of five terms: 

u{ij)'-\-hip{ij)^-X9[-etp{iJ)IRT\ . 

+ Ii<,,<yupihj)Gxp{-[ed5{hjMy^etp{hj)]/RT} + Ii<k<Kjvp(h.l)ex.Tp{- 

10 [ed5(h,lM) 

+L</<5wp(j.Oexp[-e<pO,0/^71{exp[-e^?3(i,/,/+l)/i?I^ 

u(/+2^)} 

.+etp(Kl)yRT} {exp[-ed3(/i,/,/+l)/RT]«(/+2^)+M(/+l^>«(/+2^)} 

(4) 

1 5 where for base pair n -rj , etp(ij) is the terminal A-U, G-U penalty, and edSQiJ, h-l) 
is the free energy for 5' dangling n-i on n-ri, and ed3(h,l, l+V) is the fi»e energy for 
3' dangling ri+i on /•/, -n. When r, and rj form a base pair, there are the following 
exclusive and exhaustive cases: (i) n -rj closes a hairpin; (ii) n -r, is the exterior pair 
of a base pair stack; (iii) Vi -ry closes a bulge or an interior loop; (iv) r,- -rj closes a 
20 multi-branched loop. Thus Mp(zV) is the sum of four contributions: 
i^piiJ) = exp[-e/i(zV)/^r|+exp[-es(v, i+l j-l)/RT3i(p(rf Ij-l) 

+ Yi<h«j &xp[-ebi(ij,h,l)/RTlup(h,l)-^m(ij) (5) 
where eh(ij), es{ij, i+l,>l) and ebi(ij,h,t) are free energies for a hairpin closed by 
n-rj , stacking between base pairs r.-r, and />i-r,-.i, and a bulge or an interior loop 
25 with exterior base pair n-rj and interior base pair n-n, respectively, and upm (ij) is 
the contribution from case (iv) above. For case (iv), by considering the mtemal 
helix closest to r, with closing pair n-n, the recursion for upmiij) is 
up„,(ij)=L^i<Kj up(i+l,l)exp{-[a+2c^etp(i+l,l)]fRT} 

{exp[-eJ30:+l,/,/+l)/2?I]Ml(/+2^-l)+ia(/+lj-l)-Ml(/+2j-l)} 
30 + L+2</</ wp(i+2,0exp {-[a+2cr^-b+ed3(j,i,i+l)+etp(i+2,[)yRT} 

{exp[-erf30+2,/,/+l)/i27]Ml(/+2j-l)+Ml(/+l^-i)-«l(/+2^-l)} 
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+L+j^</<?«p(^.0exp{-[a+2c+(/»-i-l)6+ed3(/.i,z+l)+erf^ 

l)+etp(h,t)yRTi . 
{exp[-e^G(/^Z,/+l)/i^2]Ml(/+2J•-l)^^l(/+W■-l)-«l(/+2J-l)} (6) 

where a, b, c are the offset, free base penalty, and helix penalty of the assumed 
5 linear penalty for a multi-branched loop: loop penalty=fl!+6x(number of unpaired 
bases)+cx(number of helices); the three sums with h=i+\, h=i+2, and h >i+3 are for 
different cases of dangling on r.-O and n -n ; and u\{kj=l) is an auxihary partition 
function for a multi-branched loop with the properties that there is at least one helix 
between and and is the 3' end of the previous helix in the loop and rj is the 
10 3' end base of the closing pair rro for the loop. Similar to the derivation of vpm (ij), 
we consider the closing base pair n ■ -rv of the helix closest to the 5' end of and 
take into account of both dangling energy and terminal penalty. Furthermore, we 
must consider both the case of no more heUx between r/'+i and r, and the case of at 
least one more helix. For u\{ij), rj^\ is the 3' base of the closing base pair for the 
15 - multi-branched loop, and m isthe 3' end of the previous helix in the loop. The 

recursion for u\{ij) is: 

u\{ij)=Yi<i^up{i,r)Q^{'[c^etp{i,r)VRT) 

uimJ)) 

20 +^^,l</^«p(i+l./)exp{-[c+^H-e/p(i+l,/)]/i^7} 
{/(/+l,z+l,0exp[-(/-06/i?r|+exp[- 

ed30M-l,/,/+l)/2?7]ul(/+2^>H/l(/+W>«l(/+2j)} 

+L+2^</^«p(A.0exp{-[c4<Mb+e^p(A,/)+erf5(/j,/,/i-l)yi?J} 
{/(/+l,/i,0exp[-0-0^'/^^+exp[-e^i3(/i,/./+l)/i?I]«l(/+2^)-hil^ 

25 «l(/+2j)} 
(7) 

where y(/+l,/i,0=l for /= j and f^HJi,T)r&w[red^W+\)IRT\ for Kj. The 

computation is 0(n*) for (4), (6) and (7) as written, and is 0(n^) for (5) when long 

interior loops are disallowed. Three additional auxiliary arrays sl(h,j), s2(h,j) and 

30 s3(}ij) may also be introduced: 

sl(hjy= lh<i<i up(h,l)Gxp {-[ed5(Ji,l,h-l)+etpih,l)VR^ 

{exp[-ediih,lJ+iyRT\uil+2j)Hi{l+ljyu(l+2J)} (8) 
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s2ihjy'lA<<jup(h,[)eiq?{-[ed5(hAhA)+etp(h,l)yRT} 

{exp[-erf3(/i,/,/+l)/iJri«l(/+2j-l)+«l(/+W'-l)-«l(/+2j-l)} (9) 

s3ihjy=I;^i4 up(h,l)Qxp{-[ed5{hJ,h-l)+etp(h,I)yRT} 

{/0+l,/i,0exp[-0-06/^ri+exp[-eJ3(M,/+l)/i?7]Ml(/+2^^^ 

5 ul{l+2j)} 
(10) 

Then the quartic sum in (4) becomes S<ft</-i sliKj), the quartic sum in (6) 
becomes 

exp[i-ed3{j,i,MyRT\ exp [-{a+lc+ih-i-m / RT] s2{Kj), and the quartic 

10 sum in (7) becomes exp {-(c^ih-i) b) I RT\sZ{h,j). At the cost of storing 

these arrays, the algorithm is cubic when long interior loops (e.g., size > 30) are 
disregarded. 

The computation may be started with boundary values for short fragments 
and proceed to longer ones using the recursions. Fori ^ ^ :^+3 ^,u(ijy=\, 
15 up(ij)=0, uliijy=0, 5l(iV)=0, s2(ij)=Q, and j3(iV)=0; forj=i+4 <n, «(/, 
i+4>=l+exp[-(eA(3)+e^p(/,i+4))//iri, 

upii, i+4)=exp[-e/i(3)/iJr], «l(i, i+4)=exp[-(c+e/j(3)+efp(/, i+4))/R21, 5l(z,z+4)=0, 
s2(iJ+4 )=0, and s3(i, i+4)=expHeh{3y-etp{i,i+4)+ed5(i, i+4, i-1 yRT\; for 1 ^ <. 
n, u(i+\,i)=l, ttl(z+l,0=0; and for 1 ^ ^-1, m1(i+2,0- 0. 

20 The algorithm accommodates the recent free energy rules and parameters 

with the reception of coaxially stacking. In particular, free energies for danghng 
ends are incorporated analytically and rigorously. These include free energies for 
stacking in a helix, stacking for a terminal mismatch in a hairpin loop (size ^ ) or 
an interior loop, penalties for hairpin, bulge, interior and multi-branched loops. Free 

25 energies for dangling ends are used for exterior and multibranched loops. For 

hairpins, a bonus for UU and GA first mismatches (included in the terminal stacking 
data) and a bonus for G-U closure preceded by two G nucleotides in base pairs are 
appUed, and a penalty for oligo-C loops (all unpaired nucleotides are C) is used. A 
table may be consulted for tetraloops (hairpin loops with four uiq>aired nucleotides). 

30 For a bulge of one nucleotide, the stacking energy of the adjacent pairs may be 

added. For interior loops, tables for 1x1, bc2, and 2x2 loops may be consulted and a 
paialty for asymmetry may be applied. A terminal A-U, G«U penalty may be 
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explicitly applied to an exterior loop, multi-branched loops, bulges longer than one 

nucleotide, and triloops (hairpm loops with three unpaired nucleotides), while this 

penalty may be included in the terminal stacking data for hairpin loops (size ^4 ) 
. and interior loops. These free energy parameters are for 37"C and IM NaCl; 
5 however, this algorithm can be used with any set of nearest neighbor parameters 

derived for other conditions. 

With the partition function «(1, «) available, the Boltzmann equilibrium 

probabiUty for a secondary structure /i„ of sequence Ri„ can then be computed. 

From a Bayesian statistics perspective, both the sequence R\„ and the secondary 
10 structure Ji„ are random variables. Thus the Boltzmann probability in (1) can be 

rewritten as a conditional probabiUty of the secondary structure given the sequence 

data: 

Sampling Structures from the Boltz manri ni.stribntion 

15 Instead of presenting a minimum free energy stracture, it has been shown by 

Ding that a statistical sample df the probable structures can be generated for a 
stacking-energy-based model. This task can also be accompUshed for more 
comprehensive energy model by reahzing that the recursions for partition functions 
correspond to sampUng probabilities. For a fragment /?,y for which it is unknown if 

20 the ends form a pair, the conditional probabiUties corresponding to the five cases 
considered for equation (4) are given by the first five of the foUowing six equations, 
respectively: 

Piriip(,ij)exp[-etpiijyRT\KiJl .(^^^ 
25 PHf^upihj^cxp{-[ed5ihjM)+etpihjWRT)/u(iJ), i<h<j (14) 

Pirup(i,l)o^[-etpii,l)/RT\ {exp[-efG(i,/,/+l)/J?rj«(/+2^)4t<(/+l^> 
«(/+2^-)}/mOV), 

i«J (15) . 

Psi,f=si(hjyum '<^<>i 

30 Pni=upih,l)QXv{-[ed5ihJM)+etpiKmT}{^^V[-^^^^^ 

IslQijX h<l<j (17) 
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where Pq+ Pij + li<k^Phj + Ik^i'tf + Wi ^^i* =1' ^nd D,</<,Pw =1- {Piu} are for 
sampling / after h is sampled for case (e). The computation is linear by using 5l(h,j) 
through {Psu} and {Pm} . When the ends are known to form a base pair, the 
probabilities for tlie four cases considered for (5) are given by the first four of the 
5 following five equations, respectively: 

QijifreyiV{-eh{iJ)IRT\liip{ij) (1^) 
e,j^=exp[-es(v,i+l j-l)/RT\up(i+lJ-l)/up(iJ) (19) 

,QuBi={ Ii<h<Ks QM-ebiiij,h,I)IRT\upM)/MiJ) W 
QijM=upJ}j)lup{ij) (21) 
10 QMBf=QM-ebi(iJ,h,l)/RT\iq>(h,l)/li<jt-<i-<i Gxp[-ebi(ij,h',iyRT]tq)(h',l'), 

i<h'<'<j (22) 

where 2/75+ (3.75/+ eyM=l, and Yi<i<i<jQhiBi='l- For sampling /i and / after 
the case of bulge or interior loop is sampled, {Qmi }may need to be computed. 
upjiij) is the contribution to up{}J) by the case of multi-branched loop. 
15 In the case of a multi-branched loop, the probabilities for sampling the 

closing base pair m-rn of the first 5' end internal helix in flie loop correspond to the 
terms in (6) for up„lij) with the quartic term expressed in terms of slQiJ). More 
specifically, we first sample h and / according to following conditional probabiHties: 
Pmvirup{i+\, t)Gw{-\a+lc+etp{i+\,T)\IRT) 
20 >^ {exp[-erf3 {i+l,/,/+l)/iiriMl(/+2 j-1)+«1(/+V-1)-m1(/+2»/-1)} 

/upm(iJ),i+l<i<J (23) 
Pijmyr^(i+2, l)&xp{-[a+2c+b+ed50\i.i+V+etp(i+2J)]/RJ} 

X {exp[-ed3 {z+2,/,Z+l)/i?7lMl(/+2^-l)+Ml(/+W-l)-«l(/+2^-l)} 
lup,n{ij),i+l<l<j (24) 
25 Ptf,2A=exp {-[a+2c+(A-i-i;6+et/3(y.i.i+l)]/i?J} j2(/i/)/i(PmOV), »+3 ^<;-l 

■ (25) 

PijhrupQi, l)exp{-[ed5(iah-l)+etpiK0VRT} 

X {exp[-ed3 {/i,/,/+l)/i?r|Ml(/+2j-l)+«l(/+l /-iyul(l+2j-l)}/s2ihj\ 

(26) 

where Zi^\<<jPim)&^i^2<<^ij{m)&'im^APim=^. and \<i<jPijhrl- {Pim\)i} for 
cases when /i=i+l, {Py(H-2)i } are for cases when /i=i+2, and {Piju > are for sampUng / 
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after h S+3 is sampled from {P,>2a} • Once both h and / are sampled, the closing base 
pan: vhi-rn of the first helix is given by setting hl==h and 11=1. 

For sampling the second internal helix, the sampling probabilities for base 
pair r,,2-ra of the helix closest to the 5' end of correspond to terms in (7) 
for m1(/1+1,;-1) (i is substituted by /l+l, and j is substituted by>l) with the quartic 
term expressed in terms of s3(lt J-\). More specifically, we first sample h and / 
according to conditional probabilities: 

2(/i+W-ixn+i)/^/'('l+l» 0expHc+e<p(/l+l,0]/if2} {fUA-^hl) exp[-(/-l- 

» 

l)b/RTl+ 

exp[-ed3 {/l+l,/,/+l)/2?r|«l(/+2^-l)+«l(/+ V-l)-Ml(/+2^-l)} 

/«l(/l+W-l),/H-l</<?-l (27) 
Q(n+imW*2)f=upill+2, l)sxp{-[c+b+etp(ll+2,I)VRT} {/(/■,/l+2./) exp[-(/-l- 

l)b/RT\+ 

exp[-efG {/l+2,/,/+l)/ii2]«l(/+2^-l)+«l(/+l,/-l)-Ml(/+2,/-l)} 
/«l(/H-l^-l),/l+2</^-l (28) 
fi(n+7)0-i).3/,=exp{-[c+(Wl7l)6]/i?r}s3(;ij-l)/«l(/l+lj-l),/l^ 

• (29) 
Q(i.i)hrup{h, l)Qxp{-[ed5ih.lM)+etp(h,l)]/RT} {f{j,h,l) exp[-{f-l-l)b/RT\+ 
expM3{/i,/,/+l)/i?ri«l(/+2j-l)+Kl(/+V-l)-«l(/+2»/-l)} 
/s3{hj-l),h<l^-l (30) 
where 3/i+i</^i e(n+/)&-iK/i+iy+ 3n+2<^i i2(/i+/)(/-i)(/i+2)/+ 3/1+3^:^2 (2(/i+i)(/-i)s3A =1, 
and 

S/w^ieo-D/ipl- {G(zi+;)(/-i)(/i+i)/}are for cases when /i=/l+l, {2(n+;)(/-i)(;i+2)/ } are 
for cases when /i=/l+2, and {Qg-iVti ) are for sampling / after h ^1 +3 is sampled 
from {2(/i+w-i)j3A }. Once both h and / are sampled, the closing base pair ria-rn of 
the second internal helix is given by setting h2=h and 12=1. Next, one must consider 
two possibilities: either there is no more helix in the loop or there is at least one 
more helix. These two mutually exclusive cases are addressed by two additive terms 
in (7) for m1(/1+1,7-1). These terms give the bmomial probability conditional on 
sampled h2 and 12 for no more heUx between r^+i and rj.\ : 
PBh2i2m=f^W2)txv{-(j-l-ll)bim 
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/{/0-,/i2./2)exp[-(/-l-/2)i>//?7]+exp[-ei3(/i2,/2,/2+l)/i?7]ul(/2+2^- 
l)+«l(/2+V-l) 

-m1(/2+2j-1)} (31) 

5 and the probability of at least one more helix is 1 - PBhiuijA)- If no more helix is 

sampled, sampling is terminated for this multi-branched loop; otherwise, the closing 
base pair of the next internal helix is sampled, followed by another binomial 
sampling. This process stops whenever no more helix is . sampled. At the end of this 
process, for L sampled intemal helices with closing base pair n^rik , 1 ^ ^, 
10 samphng probabiMes are computed with (7) for ul{l{k-X)+\J-\) (2 ^^). This 
computation and the computation for binomial sampling with probability Pshum) 
are performed (i-1) times on overlapping fragments with decreasing length. Pshm- 
i) is given by (31) with hi and /2 substituted by hk and Ik, respectively. Similar to the 
probability computation for (4), the computation of the sampling probabilities with 
15 (6) and (7) are Unear by using 52(A,/) and 53(^/). When long interior loops are 
disregarded, the probability computation for (5) is bounded by a constant. 

Fig. 5 is a flow diagram illustrating steps of a sampling algorithm in 
accordance with an embodiment of the present invention. As shown in Fig. 5, two 
stacks A and B are used by the sampling algorithm. In accordance with an 
20 embodiment of the invention, stacks A and B may be data stored in data storage 
device 120, as illustrated in Fig. 1 . Stack A stores fragments {(ij,i)}for samphng, 
where for the fragment from the ith base to theyth base, /=1 if it is known the ends 
form a pair and /=0 if this pair is unknown. Stack B collects base pairs and unpaired 
bases that will define a sampled secondary structure upon the completion of 
25 samphng. At the start, (l,n,0) is the only fragment in stack A. Specifically, a 
structure is drawn recursively as follows: 

(1) Starting with i?/„, draw single stranded Rin or a base pair according to 
probabiUties Pq, Pij, {Pa} and {Psih} for i=l,j=n; ifh is 
sampled for case (e) in the derivation for equation (4), thai / is 
30 sampled with {P/,/}. hi case (a), i.e., single stranded i2;„, the 

samphng is completed; in case (b), (l,n,l) is stored in stack A; in 
case (c), (Ii.n, 1) is stored in A and the unpaired bases from the first 
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base to the (/j-l)th base are stored in stack B; in case (d), (1,/,1) and 
(/+l,7i.O) are stored in stack A; incase (e), (M,l) and (m,7i,0) is 
stored in stack A and the unpaired bases from the first base to the (h- 
l)th base are stored in stack B. 
5 (2) For a new fragment Rij from stack A, if base pair rrry was not 

sampled previously, sampling for Ry may be performed by the same 
process for Ri„, with 1 and n substituted by i and^, respectively. 
(3) For new fragment Rg from stack A with ends paired (7=1), loop type 
maybe sampled first with probabilities {Q(/s}, {fij^B/}, {fiffM}, 

1 0 and {QhiBi) ; this step is followed by 

(3a) For hairpin loop, the unpaired bases in the loop and the 
closing pair are stored in stack B as part of a sampled 
structure and they are no longer involved in further sampling. 
(3b) For stacking, the exterior base pair (i-y) is stored in stack B 
15 and liie interior base pair defines a new fragment (i+l,y-l, 1) 

to be stored in stack A. 
(3c) For bulge or internal loop, the interior base pair in the loop (A- 
/) is sampled. The exterior base pair (1-7) and unpaired bases 
in the loop are stored in stack B and the interior base pair 
20 defines a new fragment (/», /, 1) to be stored in stack A. 

(3d) For multi-bfanched loop, an interior base pair closest to the 5' 
end of Rij is sampled first, a second interior base pair is then 
sampled. Next, one of the two cases by the Binomial 
distribution may need to be sampled: no more helix on the 3' 
25 side of the loop or at least one more helix. In the latter case, 

another interior base pair is sampled for one more helix. For 
the remaining fragment on the 3' side of the loop, the 
Binomial and interior base pair sampling is repeated until no 
more helix is sampled. Uiq)aired bases in the loop and rrrj 
30 are stored in stack B, and new firagments defined by the 

interior base pairs are stored in stack A for fiuther sampling. 
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During this process, after the completion of sampling for a fragment from stack A 
and storage of new fragment(s) in stack A and/or storage of base pair arid unpaired 
bases in stack B, the fragment in the bottom of stack A is selected for subsequent 
sampling. The process terminates when stack A is empty, and a sampled secondary 

5 structure is formed by the base pairs and unpaired bases in stack B (Fig. 5). 

The algorithm samples a structure exactly and rigorously from the 
Boltzmann equiUbrium probabUity distribution (1) or equivalently (1 1), because the 
sampling probabilities are computed by Boltzmann conditional distribution based on 
partition fimctions restricted tofiragment with or without a base pair constrauit. This 

10 is obvious for the unfolded state with a free energy of 0, whose sampling probability 
of 1/m(1 , n) is also its Boltzmann probability by (1) or (1 1). 

From statistics mechanics perspective, there is an ensemble of probable 
structures and tiius structure / can be viewed as a random variable. / can be 
expressed by an upper triangular matrix of random and dependent indicator variables 

15 /ff, 1 ^ -9' ^. /r=l if tlie ^^se is paired with the/th base, and /(f=0 otherwise. The 
requirement of at least three unpaired intervening bases between abase pair implies 
lij=0 for i+2 and r+3, 1 <i, i+3 ^. The assumption of no pseudofcnots 

implies /(^y=0 for f<i<f<j. Also, when base triple is prohibited, 3\^^Iy^, and 
3i Itj ^ .Thus, / is a high dimension random variable. Sampling directiy from a 

20 high dimensional probability distribution is often difficult hi sonie cases, however, 
the difficulty can be overcome by conditional sampling at lower dimension(s). More 
specifically, given data y, if one can sample from conditional distiibutions p(xi |y), 
Tp(xk\ xi,..., Xk.uy) (k=2,..., m), then 

x={ XI, X2,..., Xk) foUows distribution p(x|y). This is the scheme adopted for secondary 
25 stricture sampling. For given RNA sequence data, the new base pairs and single 

stranded bases are sampled by conditioning on akeady formed substructures from 

previous sampling stqps. Upon tiie completion of the process, the collection of tiie 

substractures. defines a structure sampled according to the Boltzmann equilibrium 

probability distribution (1) or equivalently (1 1). 
30 The samphng process is similar to the traceback algorithm employed in the 

dynamic programming algorithms but differs m that the base pairing is randomly 

sampled with 
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Boltzmann conditional probabilities rather than selected by minimum energy 
principle for the fragments. Because the probability of a structure decreases 
exponraitially with increasing free energy, the most likely structure in a sample is ttie 
minimum free energy structure. In other words, the minimum free energy structure 

5 has the largest sampling probability because its Boltzmann probability is larger than 
any other structure. 

For the Leptomonas collosoma spliced leader RNA (X. collosoma SL RNA) 
of 56 nt , two experimental secondary stmctures 1 and 2 have been elucidated. 
Neither is the minimum free energy (MFE) structure computed by mfold server 

10 (http://www.bioinfo.rpi.edu/aipplications/mfold/). Based on structures generated by 
our sampling algorithm, sampling estimates for the MFE structure and tiie two 
experimental structures are computed (Table 1 in Fig. 6). The MFE structure has the 
largest observed frequency among all sampled structures. Furthermore, the 
Boit2maim equilibrium probability of a stracture (equation (1) or (1 1)) is closely 

1 5 estimated by its maximum likelihood estimate (MLE) computed from the sample 
and is contained in the 95% confidence interval (CI). This gives an illustrative 
example of the theoretical assertion tiiat the algorithm san[q>les secondary structures 
by ttieir Boltzmann equilibrium probabilities. 

Because there are no more than (?i-3)/2 base pairs in a secondary structure and the 
20 time for sampling a pair is at most 0(n) when long interior loops are disallowed, the 
time of the sampling algorithm is bounded by Op{t^), i.e., stochastically quadraitic m 
the worst case. Thus, once the forward recursions for the partition ftmctions are 
completed in cubic time, a sample of stractures can be quickly generated. This is 
illustrated by Table 2 in Fig. 7 for ten biological sequences having a wide range of 
25 lengths (the time for calculating partition functions can be perfectly fitted by a cubic . 
curve; a figure is not shown here). 

Class Representation of Boltzmann Ensemble of Secondary Structures 
Classification of sampled structures. For the Leptonwna collosoma spliced leader 
RNA (L. collosoma SL RNA), two competing secondary stiaictural form 1 and 2 
30 have been indicated by ribonuclease data, although the role of structures has yet to 
be identified. 1,000 structures sampled by our algorithm for this sequence of 56 
bases were examined. It was found that the structures fall into two classes 1 and 2, 
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coiresponding to the two experimental structural forms 1 and 2. Class 1 can be 
further subdivided into classes lA, IB, and IC; each of liiese subclasses has a yet 
higher level of structural similarity among its members. Class 2 can be further 
broken down into classes 2 A, and 2B. A group of structures can be displayed by 
5 means of a two-dimensional histogram (2Dhist). Distinct patterns in this 
representation are indicative of common structural features for the group, whereas 
scattering , of the squares would indicate its structural diversity. As illustrated by 
Figs. 8A-C, structures in classes lA, IB, and IC have in conomon two helices, 
represented by the two clusters of 5 squares and 4 squares, respectively. Specifically, 
10 the first helix is formed by base pairs U^* !A'^ G^'lC^^ A'W\ A'W, and 
G'^IC^". The second helix is formed by A^^ \V'\ C"!G^', A^^^IU'^ and G^'.C'^ On 
the other hand, the histograms also show that members of these classes have 
different structural features. Structures in classes 2A and 2B also have in common 
' two helices (Figs. 9A-B), which are different 6x>m the two common helices for 
15 classes lA, IB, and IC. The major difference between class 2A and class 2B is the 
existence of an additional helix for class 2B. This helix is represented by a cluster of 
squares in the bottom left portion of the histogram in Fig. 9B. 

Probability of a class and the Boltzmann probability of its representative. 
For a class of similar structures, the structure occurring with the highest firequency 
20 (i.e., the most probable structure) in the sample is taken as the representative pf the 
class. Class lA is represented by experimental stractural form 1 (Fig. lOA). The 
minimum free energy (MFE) structure fix)m mfold shown by Fig. lOB is the 
representative of class IB. Class IC is represented by the structure in Fig. IOC that 
is the MFE structure with a short helix removed. Experimental structure form 2 (Fig. 
25 llA) is the representative for class 2A The representative for class 2B shown by 
Fig. IIB is experimental structural form 2 with an additional hairpin-heUx stem on 
its long single-stranded 5' end. The probability of a class is computed by its 
frequency in the sample, the Boltzmann equiUbrium probabiUty of the representing 
structure is computed by usmg its free energy, and the partition fimction available 
30 from the forward step of the algorithm (equation (1)). The.size of a class is reflected 
by the class probabiUty. It is a surprising observation tiiat the Boltzmann probabihty 
of the representative structure is not necessarily reflective of the magnitude of the 
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class probabiUty (Table 3 in Fig. 12, Fig. 13). For example, the probability for class 
IC is about 13.4% larger than that for class IB; however,, the Boltzmann probability 
of class IC's representative is merely 37.8% that of the representative structure for 
class IB. 

• 5 "Entropic class". For class 2B, the ratio of the class probability and the 

Boltzmann probability of its most probable member is 290.70, which is strikingly 
high. iDespite the very small. Boltzmann probability for its most probable member, 
this group contains a substantial number of similar structures such that the collection 
of these structures has a much higher aggregate probability. Such "entropic class" of 

10 structures can be revealed by sampling through classification. However, a structure 
in an entropic class can be easily overlooked when it is examined individually on the 
basis of its free energy or Boltzmann probability. 

Table 3 in Fig. 12 presents a summary of the above analyses. Although the 
two experimental structures are 25.2% and 15.9% off the minimum free energy, 

15 respectively, they are both predicted by the sample. Version 3.1 of mfold (2) was run 
on mfold server (http://www.bioinfo.rpi.edu/^plications/mfold) to fold this 
sequence. For suboptimaUty percentage P under 15 (default=5), only the optimal 
folding is returned. Only for a large P, e.g., i^BO, the two alternative structures are 
returned as suboptimal foldings. This example underscores the importance of 

20 examining suboptimal structures. It also shows that important alternative structures 
and structural motife can be revealed by a statistical sample of the Boltzmann 
ensemble. These fiuidings suggest that the Boltzmann ensemble of RNA secondary 
structures can be more adequately represented by classes (ttirougji 2Dhist) taken 
with their probabilities, together with the class-representative structures and their 

25 Boltzmann probabilities. Thus, through structure classification, the sampling 
j5)proach can achieve the objective of both efficient and statistically unbiased 
lepres^tation of suboptimal foldings. 
Prediction of Alternative Structures 

The analysis of X. collosoma SL RNA suggests that alternative biological structures 
30 can be adequately revealed by a statistical sample. This can be investigated further 
by applying the sampling algorithm to mRNA secondary structure prediction. 
mRNA secondary structures can play a regulatory role of determining the rates of 
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translation initiation. This is explained by a model of coexisting alternative 
strictures: one structure favors the translation initiation while the other inhibits the 
translation initiation. Also, it has been argued tiiat the accessibility of the initiation 
codon is important for maximizing expression. The secondary shiicture of a mRNA 
5 is generally unavailable by experimental means, because complete structural probing 
by chemical or enzymatic methods is very difScult for long-chain RNAs. A rare 
exception is ti\e short mRNA for cIII gene of bacteriophage 8y for which two 
conformations A and B (Fig. 14A and Fig. 14B) were elucidated and were 
demonstrated to coexist in equilibrium. The sequence of 132 nucleotides in the 
10 structures covers 46 nucleotides of the coding region and 86 nucleotides upstream 
from the initiation codon A^UGI In structure A, the initiation codon and part of the 
J Shine-Dalgamo sequence UT^^AAGGAOr^ are in a closed, base-paired 
conformation such that the ribosome binding site is occluded. In structure B, the 
ribosome binding site is open for interactions. It is speculated that the cUI gene 
15 expression is regulated at the translation initiation level by the ratio of the two 
structures at the equilibrium, and changes in temperature or Mg^"^ concentration, and 
" perhaps ribosome binding can shift the equiUbrium. 

For cm mRNA, a sample of 100 structures was generated by the algorithm 
and was manually examined. In this sample, 89 are close variants of structure A. The 
20 left stem in structure A is precisely predicted in 67 of the 89 stiiictiires. The exact 
right stem and a modification with one or both of additional pairs AT^^iU^^, 
AT^^rU^^ are predicted in 72 of the 89 structures. Appreciable variability in the 
location of interior and bulge loops is observed for the middle stem. Structure C in 
Fig. 14C is one of three structures in the sample which closely resemble structure B. 
25 The appreciable modification is the additional short heUx involving the Shine- 
Dalgamo sequence formed by base pairs Gr'°:Cf" and G^':C*^.The remaining eight 
stioictures (structures not shown) in the sample do not resemble either structure A or 
. B and have diverse structural features. The optimal folding by mfold is a 
modification of structure A with tiiree additional base pairs C^^'':G^^^ AT'^:!/^ and 
30 Ar'^:U^\ witii the MFE of eGEa? =!48.5 koal/mole. Structure A is well predicted 
by the optimal folding. Its fi-ee energy is eGEa? =!46.1 kcal/mole, 5% off the MFE. 
Stiucture B has eGEa? =140.2 kcal/mole, 17% off the MFE. Stioictiire C has eGEs? 
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=!42.9 kcal/mole, 12% off the MFE. For P=30, neither B nor a variant resembling B 
as closely as C is predicted by suboptimal foldings from mfold, although both 
structures B and C are well within this range of suboptimality. By using the option 
for specifying base pair constraint in mfold, we verified that structures B and C are 
5 indeed in the "missmg" set of suboptimal foldmgs that are excluded by the 
algorithm design for mfold. In contrast to the stability indicated by the free energies, 
experimental analysis showed that structure B is favored by a factor of about 3. The 
discrepancy could be explained by tertiary interactions which preferentially stabilize 
structure B. This application not only presents an exanq)le that an important 
10 alternative structure can be better predicted by a sample of moderate size, but also 
shows that alternative structures of low probability can be biologically important 
bedause stability contribution from, potential tertiary interactions are unaccounted 
for. The finding also suggests the sampling algorithni can be well suited to the 
prediction of secondary structure of mRNAs, because an mRNA may have many 
15 conformations in an intracellular environment. 
A«=!sip;mne nt of Probabilities for Structural Motifs 

In many applications, certain structural motifs are of biological interest. 
Sampling also enables probabilistic prediction of any motif with or without specific 
constraint(s). The probability of a motif can be directly estimated by the frequency 
20 of its occurrence in a sample. This is shown in Fig. 15 for several constrained 
motife involving the AUG initiation codon or the Shine-Dalgamo sequence of CIH 
mRNA, and for a helix, a base pair and a single-stranded region of two bases. The 
algorithm by McCaskill is limited to the probabiUty calculation for individual base 
pair and un|)aired base. Probabihties of larger motifs such as helices of two or more 
25 base pairs and single-sfranded regions of two or more unpaired bases are not 
available from this algorithm. In contrast, the sampling algorithm is readily 
applicable for this purpose. 

Boltzmann-Probabilitv-Weiehted Densitv of States and Fr ee Energy Distributions 
Cupal et al. presented a recursive algorithm to compute the free energy distribution 
30 of all secondary structures (i.e., density of states (DOS)). The algorithm is 0(n^) in 
time with a memory requirement of 0(«^), and is thus computationally prohibitive 
even for sequences of moderate length. For short sequences, this algorithm is useful 
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for the study of evolution by comparison of DOS between biological sequences and 
random sequences of tlie same composition (Higgs). 

The free energy distribution of probable structures for either short of long 
sequence is available from our sampling algorithm and is referred to as the 
5 Boltzmann-probability-weighted density of states (BPWDOS) (Figs. 16A, 17A). 
Inforaiation for the BPWDOS can be displayed in alternative ways for showing the 
probability that the free energy of a stmcture is within a threshold of the global 
minimum or is in an energy interval (Figs. 16B, 16C, 17B, 17C). Sampling also 
gena:ates representative shnictures for a given low energy interval. This overcomes 
10 the disadvantage of the algorithm by Cupal et al that there is no information about 
individual structures corresponding to the low energy states. These distributions 
could be valuable for evolutionary studies on long sequences and studies on the 
RNA energy landscape {Schuster & Stgdler). 

The sampling algorithm in accordance with the invention is shown to be an 
1 5 appealing alternative to existing algorithms for RNA secondary stricture prediction. 
A sample from the Boltsanann distribution can adequately delineate the Boltzmann 
ensemble of secondary structures through classification. This approach avoids the 
limitation of suboptimal folding presentation by a designed set and the difficulty 
with a complete enumeration of suboptimal foldings. The algorithm is shown to 
20 meet the challenge of predicting alternative structures. The prediction of structural 
motifs can be useftil in applications. A promising application to antisense target 
prediction by fee probabilities of single-stranded regions will be described in further 
detail below. The sampling approach of the present invention is also powerful tool 
for some important RNA research problems. The capability of predicting alternative 
25 structures suggests sampling can be a promising method in the application to the 
prediction of conformational switch, aphenomenon involved in translational 
regulation, transcriptional attenuation in prokaryotes, translocation process, protein 
biosynthesis, viral regulation, etc. Because an algorithm according to the present 
invention implicitly simulates folding pathways accordmg to statistical mechanics 
30 principle, this approach may allow for adequately characterizing sequential folding 
and folding pathways and revealing metastable states into which an RNA can be 
ti-apped during folding. The classes may correspond to different folding pathways. 
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Sampling may also provide a tool for statistical delineation of the free energy 
distribution (i.e., the density of states up to a proportionality constant) of the 
Boltzmann ensemble, and a test to determine if this distribution follows a certain 
pattem(s) and if it displays two local minima in the case of conformational switch. 
An algorithm may be 0(n^) by disregarding long interior loops. Under an 
assumption on interior loop asymmetry, an approach has been developed to reduce 
the tune of mterior loop evaluation from Oin') to 0( n^). The sampling stage of the 
algorithm may be implemented using, e.g., Fortran 77, on a computing device such 
as system 100. It is noted that an algorithm in accordance with the present invention 
may be programmed in any computing device or implemented by designing any type 
of dedicated hardware for performing the steps thereof For an RNA sequence of 
589 nucleotides, it takes 102 s to complete the partition function calculation, and 87 
s to generate 1,000 structures on a 300 MHz CPU of a Ultra 2 Scalable Performance 
Architecture ("SPARC") workstation. Manual classification of structures can be 
performed for a sample of moderate size. An automated procedure for classifying 
large number of structinres may be used to fully take advantage of the sampling , 
approach. While coaxial stacking interactions might not be included in a rigorous 
dynamic progranuning algorithm, a recalculation of the free energies of suboptimal 
structures has been proposed to incorporate coaxial stacking for multi-branched 
loops. Similarly, a resampling schane that includes an energy reevaluation step for 
sampled stractures and a resamphng step of these stractures based on modified free 
energy values may accommodate coaxial stacking. 

Probability Prnfilinp for Predicting Sinele-Stranded Region s in RNA Secondary 
Structure 

For single-stranded bases in E.coli tRNA^^ Fig. 18A demonstrates a 
probability profile estimated from 1,000 sampled secondary stmctures according to 
the present invention, the probabiUty profile computed by the Viemia KNA package, 
the profile indicated by the minimum free energy CMFE") structure computed with 
version 3.1 of mfold (http:/www.bioinfo.rpi.edu/applications/mfoldy), and the one 
indicated by the phylogenetically determined structure. A sample size of 1,000 was 
found to be adequate because the profile estimates from this sample and a larger 
sample of 10,000 structures were not readily distinguishable. For the unpaired 
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individual bases, l3ie probabiUty profile andfhe profile by the MFE structure are 
comparable. This is generally expected because the MFE structure is the most 
probable structure in the sample. However, the MFE structure substantially 
underpredicts the width of the region around nucleotide G^^ of the anticodon loop, 
5 while a significant portion of the sample by the present mvention adequately reveals 
the width. For the region between nucleotide G^'^ and the sampling approach 
and the latest version 1,3.1 of the Viemia RNA package gave comparable results; 
however, for the region between nucleotide andC^^ the sampling profile by the 
present invention predicted the phylogenetic structure substantially better than the 
10 Viemia profile. The current version of Vienna package is based on an earlier 
compilation of Turner's free energy parameters. It has been shown that the latest 
update improves the prediction of secondary stinicture. This explains the better 
performance by the sampling algorithm of the present invention; 

Fig. 1 8B shows the probability profile of a sample for single-stranded 
1 5 sequences with a sequence width of four nucleotides. For comparison with the 

phylogenetic structiire, a dot with coordinates (i, 1) is shown in Fig. 18B if the four 
nucleotide sequence starting at nucleotide i is sin^e-stranded, and a dot with 
coordinates (i, 0) is plotted if any of the four nucleotides is base paired. Similarly, 
the MFE stiTicture is plotted. The unstructured region of the anticodon loop is 
20 missed by the MFE structure, but is revealed by the sampling profile through a peak 
of substantial probability. For the two sampling profiles in both Fig. 18A and Fig. 
18B, not only do the single-stranded regions in the phylogenetic stracture 
correspond well to the local peaks of the probability profiles, but also the width of 
the regions matches tiie width of the peaks with only one exception, region 
25 ^^AUGGCAU^* of tiie anticodon loop. The peak for this region in the jphylogenetic 
structure is slightiy narrower because two Watson-Crick pairs A^^-U^^ and U^^-A^' 
are likely to be predicted by any free-energy-based algorithm, while these two base 
pairs are absent in the phylogenetic structure. In Fig. 18B, the peak of tiie sampling 
profile between A^^ and U^' is much lower tiian the corresponding peak in Fig. ISA 
30 because, while the single-stranded probability for each of G^'', G^^ , and C^^ is over 
0.96, tiie probabilities for U" and A^' are below 0.28. Thus, for identifying a 
single-stranded region of at least four nucleotides, a high peak in the profile of 
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single-stranded bases can be visually misleading when the width of the peak is 
smaller than 4 nucleotides. The probability profile of single-stranded sequences 
presents a clearer picture of potential antisense sites, because it has fewer and 
narrower peaks than the profile of single-stranded bases. This probability profile 
5 cannot be obtained by the Vieima RNA package or other existing computational 
methods. 

To further illustrate the sampling approach of the present invention, 
probabiUty profiles in Figs. 19A-19D are presented for the following representative 
RNA sequences with phylogenetically determined secondary structures: Xenopus 
10 laevis oocyte ("Xlo") 5S rKNA. domain H ofE. coli 16S rRNA, E. coli KNase P, 
and group I intron from 26S rRNA of Tetrahymena thermophila. For these 
sequences, phylogenetically determined single-stranded regions correspond to peaks 
in the probabiUty profile with near certamty (Figs. 19A to 19D) (Pc in Table 5 of 
Fig, 20). On the other hand, peaks with at least a maximum probabiUty of 0.5 
1 5 ahnost certainly point to single-stranded regions (Pc2, in Table 5 of Fig. 20); peaks 
with a maximum probabiUty between 0.2 and 0.5 have at least a 50% chance of 
correctly indicating single-stranded regions (Pc3 in Table 5 Fig. 20), whereas there is 
a far smaller but appreciable chance for peaks with a maximum probabiUty under 
0.2 (Pa in Table 5 of Fig. 20) to correctly indicate single-stranded regions. As in 
20 the case oiE.coli tRNA"^"*, for aU of these RNA sequences, the probabiUty profile 
reveals more single-stranded regions in the phylogeaetic structure than the MFE 
structure (Pi in Table 5 of Fig. 20). The substantial improvement is because the 
altemative structures in the sample by the present invention are able to reveal 
structural motifs not predicted by the MFE structure. On the other hand, the motife 
25 in the MFE structure are well reported by the sample because the MFE structure is 
the most probable structure in the sample. The improvement is noticeably greater 
for E. coli RNase P, which has highest percentage of nucleotides in pseudoknots, a 
motif not aUowed by either mfoM or the algorithiri according to an exemplary 
embodiment of the invention. 
30 The results reveal variation in tihie reliabiUty of prediction among different 

RNAs. For fi-ee energy minimization for the prediction of RNA secondary structure, 
variabiUty in the reUability of prediction for differeiit RNAs has been weU 
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documented. Because the sampling algorithm of the exemplary embodiment of the 
invention is also based on free energies, it is not surprising to observe a similar 
phenomenon. There is also substantial variabiUty in the maximum probabilities for 
the peaks that correspond to smgle-stranded regions. Similarly, for minimum free 
5 energy prediction of secondary stmcture, there is variability in the reUabiUty of 
predictions for different regions of a sequence. The smmnary in Table 5 of Fig. 20 
indicates that single-stranded regions predicted by high probabiUty peaks are "well- 
determined" by the probabiUty profile. In other words, these regions are highly 
stable and thus are present with high probability in a sample of probable secondary 
10 structures. For regions of lower stabiUty, their probabilities are either moderate or 
low, because alternative structural motifs wiU be more likely to be present in the 
sample. The sampUng algorithm of the exemplary embodiment gives a complete 
statistical presentation of probable competing alternative stmctures. Thus, the 
probability profile provides a statistical delineation of single-stranded regions with 

15 varying stabiUties. Furthermore, by assigning probabiUstic confidence measure in 
predictions, new accessible sites can possibly be identified, as illustrated by Fig. 21. 

Antisense Application 

The rabbit jS-globin mRNA (589 nt, GenBank accession V00879, coding 
region 54-497) has been well studied for antisense inhibition of protein synthesis. 
20 Anll-merandthreel7-mershavebeenusedtotargetrabbit^-globinmRNAina 

wheat germ extract as well as in microinjectedZenoptis oocytes. The inhibition of 
cell-free translation by eight phosphodiester antisense oUgonucleotides ("ASO"s) 
targeted to this mRNA has been examined. A combinatorial oUgonucleotide array 
technique for hybridization assessment of oligonucleotides within a given region has 

25 also been used. For the rabbit ^-globinniRNA, an array of 1,938 oligonucleotides 
up to a laigth of 17 bases, has been used to measure the ASO:mRNA hybridization 
potential. These oUgonucleotides were complementary to the first 122 bases of the 
mRNA. Three oUgomers, BGl, BG2, and BG3, were chosen for study by in vitro 
translation in wheat germ extract and the RNase H assay. 

30' In an analysis, the results for BGl, BG2, andBG3 are directly compared to 

the data from the odier two groups, because all these ASOs were studied in ceU-free 
translation systems and the percentages of translation inhibition were reported 
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(Table 6 in Fig. 22). The inhibition percentages facilitate a quantitative comparison 
and assessment of the correlation between inhibition of cell-free translation and 
computational predictions. Qualitative array hybridization data and the 
computational predictions were summarized and compared separately (Table 7 in 
5 Fig. 23). 

The probabiUty profile with a sequence width of four nucleotides was 
computed with a sample of 1,000 secondary structures for the rabbit /S-globin 
mRNA. The probability profile and the profile by the MFE structure for the region 
A'-U"° are shown in Fig. 24, as the ASOs in these studies were targeted to this part 
10 of the mRNA. The target sites on the mRNA, the inhibition effect in cell-free 

translation systems in the three studies, and the hybridization potential predicted by 
the probability profile are summarized in Table 6 in Fig. 22. For this analysis, the 
hybridization potential was assessed as high if, for the target site, there was at least 
one peak with probability ^.6; the potential was considered moderate for a peak 
1 5 ynith probabiUty between 0.3 and 0.6; the potential was low for a site with a 

probability under 0.3 of being partly single-stranded. For ASOs in Cazenave et al, 
the inhibition figures for wheat germ extract were estimated from Figures 3 and 7 in 
Cazenave et al. The region A^-A'*^ was targeted by five of ei^t ASOs in 
Goodchild et al. There are three high probability sequences in this region: A^-C*, 
20 A^*-U^\ and U^^-A"*^. They e3q)lain the predicted high hybridization potential for 
j85, j86, j87, j88, and /36+/37. The moderate inhibition by /31 indicates that A* W 
alone is not as effective as the other two. One e>q)lanation is that flie two adjacent 
nucleotides, C" and G", are predicted to ahnost certainly engage in GXC pairing, 
and thus they might present a substantial energy barrier for hybridization elongation 
25 by "zippering". The high inhibition by j88 and i86+ ^ suggests that an antisense 
effect can be enhanced by simultaneously targeting several high potential sites. 
Consistent results for BGl, BG2, and BG3 were found in Miner et al Clear 
inconsistency between the predictions by the present invention and the observed 
inhibition was found for 17 Glo [1 13-129] of Cazenave et al, which appears to be 
30 an exception to the mle of thumb of at least foiu: unpaired bases. In the case of an 
effective antisense site with less than four unpaired bases, the site would not be 
predicted by the probability profile witti a sequence width of four nucleotides. On 
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the target site of 17 Glo [1 13-129], the probabilities of being unpaired for U^^^ and . 
G are 0.61 and 0.56, respectively, but the probabilities are less than 0.1 for 
adjacent bases U^^^ and G^^^. Among many other potential reasons for poor 
prediction in this case could be tertiary interactions and KNA-protein interactions, 
5 and self-folding of the oligomer that are unaccounted for by the current algorithm. 
If low, moderate, and high hybridization potential are associated with 
inhibition of 0-19%, 20-39%, and 40-100%, respectively, then for 13 of the 16 
ASOs (81%) exaniined, the hybridization potential revealed by tihe probability 
profile is indicative of the antisense inhibitory effect For all the ASOs, there is a 

10 significant correlation (P value=0.0147, correlation coefficient=0.597) between the 
hybridization potential predicted by the probability profile and the degree of 
translation inhibition. For jSl-jSS, there is a substantially higher correlation (P 
value=0.0037, correlation coefficient=0.882). In contrast, Stidl et al found no 
significant correlations between observed inhibition and any predictive indices for 

15 i81-iS8. For ASOs in Cazenave et aL, Stull et al foxmd a correlation between Dscore, 
one of tibieir indices, and inhibition for oligomer concentration at 6/iM, but no 
significant correlation for oligomer concentrations below 6^. The probabiUty 
profile and the MFE structure give comparable predictions of single-stranded 
regions. However, without an associated measure of confidence, there is a lack of 

20 correlation between the binary prediction by the MFE structure and the degree of 
translation inhibition {P value=0.567, correlation coefficient=0.155). This 
exemplifies the observation that there is limited success in using MFE structure for 
antisense design. Because the sampling profile provides a statistical measiure of 
confidence in the predictions, it is not surprising that the profile is found to be 

25 generally indicative of the degree of translation inhibition. 

For the hybridization intensity data in Milner et al , there is very good 
agreement between the hybridization intensity and the probabiUty profile for regions 
C^-C^^ A^^-C^, and G^^-G^^^ (Table 7 in Fig. 23). The hybridization intensity for 
region A^*-C^^ is in reasonable agreement with the probability profile. In this 

3d region, the maximum probabiUty of a peak is about 0. 1 • For a peak with a maximum 
probabiUty under 0.2, there is an appreciable chance for the peak to correctly predict 
single-stranded regions (Table 5 in Fig. 20). Thus, weak hybridization is possible 
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for low but appreciable probabiUties. For region A^-C", it is an iiitriguing contrast 
that the hybridization data are in disagreement with both the data in Goodchild et dl. 
and the probability profile, but the probability profile is in good agreement with the 
data in Goodchild et aL The length of the oligomers, 20-45 nt in Goodchild et al., 
5 and at most 17 nt in Milner et al. , offers an explanation for the conflicting results. 
Goodchild et al. indicated that a greater inhibition could be obtained by covering a 
longer portion of the mRNA. This is evidenced by the greater inhibition of ^8 or a 
mixture of |86 and ^ than either /36 or i87 alone (Table 6 in Fig. 22). There are 
several sharp peaks in ttie probabiUty profile for this region. Thus, a plausible 
10 explanation from the profile is that substantially longer ASOs cover more peaks in 
this region, and hence, enhance the chance df both nucleation and propagation of 
duplex formation. Although oligomer length has a positive effect on translation 
inhibition in this case, this may not be generally true. It is also noted that the 
conclusion of insignificant hybridization by Milner et al. for region A^-C" appears 
15 to be based on the lack of a continuous subregion with detectable hybridization. In 
this region, there are two isolated intensity bands in Figure 1 of Milner et al. , 
indicating substantial hybridization at sequence positions that were also targeted by 
Glo [3-19] of Cazenave et al. 

The six oUgomers contaming bases C^^-C^ (Table 7 and footnote in Fig. 23), 
20 and BG2, i82, Glo [5 1-67] in Table 6 in Fig. 22 share one common feature on the 
profile: a relatively wide, high probability peak between A^ and U^^, with ^AUG^^ 
being the initiation codon. This suggests that a smooth and relatively wide peak on 
the probabiUty profile can be a high potency antisense site because the chance of 
hybridization is improved for a wider single-stranded region. 
25 Rational De si pn of Antis ense Olieos 

Quantification of Nucleation Potential. Because a predicted site can be 
targeted by numerous oligos of the same length, and by many more with varying 
length, a quantitative measure of the nucleation potential is necessary for efficient 
oligo screening. A sampUng-probability-weighted binding energy for measuring the 
30 binding affinity for nucleation, AG nucleation, can be computed to address this issue. 
For the targeted sequence of m ribonucleotide 5'-rir2...r„-3' and the antisense oUgo 
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of m deoxynucleotides 3'-di dz-dm-S', this quantitative measure of nucleonation 
potential is computed by 

AG nucleation" AG initiation+Si^^-lPi AGstaclcing(0 

where AGstacidng(o is the energy for the ith RNA/DNA base pair stack and AG initiation 
5 is the free energy for duplex initiation, and P, is the probability that r,- is unpaired in 
the secondary structure of the target. The thermodynamic parameters for 
RNA:DNA hybrid duplexes (Suginioto et al) are used.in the calculation. The 
probability P/ is computed by our RNA structure sampling algorithm. 

In the case that the target sequence is completely single-stranded with 
10 certainty, 

AG nucieation is simply the sum of the initiation energy and the stacking energies for 
the RNA:DNA hybrid, because all of the weighting probabilities are 1 . If every base 
in the target sequence is base paired with certainty, then all the weighting 
probabilities are 0, and AG „„cieation is AG i„itiation=3.1 kcal/mol. In this case, 

1 5 nucieation is energetically unfavorable. Through the tiiennodynamic paraneters, 
GC contCTt is accounted for indirectly in the calculation of 
AG nucieation • Morc importantly, the uncertainty in the prediction of local structure at 
the target site is addressed through the weighting probabilities. 

The results with rabbit jS-globin mRNA suggest that relatively wide, high 

20 probability peaks on the probabihty profile are very likely to be effective antisense 
sites. The probability profile approach of the present invention offers a 
comprehensive computational screening of the entire mRNA or viral RNA For 
several other mRNA sequences with length ranging fcom \ kb to 3 kb, fifteen to 
twenty high hybridization sites per kb (data not shown) have been observed. These 

25 sites provide ample opportunities for rational design of antisense oligomers. An 
antisense oligomer is the reversed complement of a target sequence. The 
identification of optimal oligomers could be particularly important for antisense 
drug developmait. In applications, one can focus on sites within a particular mRNA 
region (e.g., coding region) of interest. In designing antisense oligomers, some basic 
30 rules are appUcable for avoiding non-antisense effects and for enhancing antisense 
potency. Four Gs in a row should be avoided. To minimize the possibiUty of 
binding to a non-targeted mRNA with strong sequence homology at the binding site, 
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a BLAST search for a prospect oligomer can be performed to ensure no appreciable 
overlap with other mRNAs in the experimental system. In particular,, investigators 
need to be aware that translation initiation sites cau have good homology in both 
related and non-related genes. To avoid stable intra-molecular structure within 
5 oligomers, oligomers that contain self-complementary regions (i.e., palindromic 
sequences) should not be used. Other experiniehtal guidelines may also be used. 

Rational Antisense Desigti. Based on probabiUty profiUng, a rational design 
procedure maybe adopted for rational selection of antisense oligomers: 

1. Computation for the construction of the complete probabiUty profile of the target 
10 RNA. 

2. Selection of accessible sites predicted by high probabiUty peaks on the profile. 

3. Select tiie antisense oUgos (e.g., 20-mers) for each accessible site with the 
strongest probabiUty-weighted-binding energy calculated with RNA:DNA 
stacking energy parameters. 

15 4. Avoidance of three contiguous Gs, a motif known to cause non-specific effects. 
5. Perforaiing aUgranent search (e.g., BLAST) to avoid significant homology to 
other genes in the experimental system. 

Example of antisense design. For E. coli lacL (GenBank Accession No. 
U00096), which codes for /3-galactosidase, the complete profile reveals 20 or so 

20 "weU-detemiined" high antisense potential sites per kilobase (Fig. 25). A close-up 
examination of any region of the target can be faciUtated by a zoomed-in version of 
the profile (Fig. 26, for nt 2200 through nt 2400). Ten antisense 20-mers were 
selected firom the above design steps, and are Usted in Table 8 of Fig. 27. 
Mutual Accessibility Plot for Predicting RNA:RNA L iteraction 

25 For RNA:RNA interactions through antisense binding, e.g., between RNA target and 
chemically synthesized or naturally occuning antisense ribonucleic acids (antisense 
RNAs), or between RNA target and fra^w-cleaving ribozymes, the structures of both 
RNA5 are important Thus, as iUustratedby Fig. 28, antisense binding is largely 
dependent on the accessibility of both the bases on the target site and their 

30 complementary bases on the antisense RNA or ribozyme. This mutual accessibility 
between two RNAs can be assessed with an overlay plot of probabiUty profiles for 
the two RNAs at the target site (Fig! 29). The mutual accessibiUty plot thus provides 
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a new tool to address local accessibility of both RNAs at the site of interaction. 
Rational design of 7>gn.g-Cleaving Ribozvme 

For franj-cleaving ribozymes (e.g., hammethead or hairpin ribozyme, as 
illustrated by Figs. 30A, BOB), the binding by the ribozyme's antisense ann(s) is the 

5 rate-luniting step . Thus, identification of accessible regions on the target is 

important for ribozyme design. On the other hand, for a hammerhead ribozyme, the 
two binding arms of need to be also accessible for interaction with target sequences 
flanking tiie cleavage triplet, e.g., GUC. The mutual accessibility between the target 
RNA and a ribozyme can be assessed with an overlaid plot of probability profiles at 

10 the target site (Fig. 29) . The stmctore of the ribozyme is equally important It has 
been an open issue to what extent incorrect ribozyme folds can be tolerated. The 
answer to this question may partly depend on the equihbrium between the correct 
ribozyme fold and alternatives {Christoffersen et al). A probabilistic measure of this 
equilibrium calculated through classification of sampled structures for the ribozyme 

15 maybe a good indicator forappropriateness of the catalytic domain of the 

ribozyme. 

Rational ribozyme design. Based on probability profiling for botii the target 
RNA and the ribozyme, and statistical folding of the ribozyme and subsequent 
structure classification, the following steps may be involved in rational design of 
20 fra/w-cleaving ribozymes: 

1 . Computation for the construction of Ae complete probabiUty profile for the target 

RNA. 

2. Evaluation of accessibility of both the cleavage site (e.g., GUC for hammerhead 
ribozyme) and its flanking sequences. 

25 3 . Specification of the bases of the ribozyme binding arms and subsequently the 
ribozymes for accessible sites. 

4. Computation of the probability profile for each designed ribozyme. 

5. Evaluation of accessibility of the ribozyme binding arms. 

6. Evaluation of appropriateness of tiie structure of the catalytic domain of the 

30 ribozyme by structure classification for estimating the equilibrium between correct 
fold and alternatives. 
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7. Evaluation of mutual accessibility between the ribozyme binding aims and their 
target sequences. 

Example of ribozyme design. The flanking sequences of all 23 GUC triplets 
for the breast cancer resistance protein (BCRP) mRNA (241 8 nt, GenBank 

5 Accession No. AF098951) were analyzed for accessibility by probabiUty profiling. 
For five of these sites, both flanking sequences are predicted to be accessible. For 
one of the five sites, nt 1896-1898 on the target mKNA, the resulting ribozyme has 
good mutual accessibility for both binding amis as illustrated by Fig. 28B). 
Rational Design of SiRNAs 

10 A probability-weighted-binding energy for the hybridization between the antisense 
strand siKNTA and its complementary sequence on the target can be computed . The 
calculation is the same as the calculation of nucleation potential for antisense oligos 
with the only exception that RNA:RNA stacking energy (JCia et al) is used here for 
RNA:RNA-hybridization. Coupled with probability profiling for accessibility and 

15 other considerations, a rational selection process of siKNAs may involve the 
following steps: 

1 . Computation for the construction of the complete probability profile of the target 
RNA. 

2. Selection of accessible sequences (e.g., AA(N19) motifs, where N is any 
20 nucleotide) of desired length (e.g., 21-23 nt) on the target. 

3. Computation of probabiUty-weighted-binding energy with RNA:RNA stacking 
energy parameters for the duplex formed between each selected target 
sequence and the antisense strand siRNA. 

4. Computation of GC content for selection of target sequences with preferred GC 
25 content (e.g, low to balanced GC). 

5. Performing ahgnment search (e.g., BLAST) to avoid significant homology to 
oflier genes in the experimental system. 

Example ofsUiNA design. Exon 3 of human estrogen receptor 1 (ESRl, 
GenBank Accession No. NM_000125) is the region of interest. The entire 6450 nt 
30 mRNA of ESR 1 was folded by the sampling algorithm. There is a total of 470 
AA(N19) motifs on the mRNA, including 5' UTR and 3' UTR. The probabihty 
profile for exon 3 is displayed by Fig. 31. There are six AA(N19) motifs within exon 
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3, and three more with majority of bases within the exon. Three of the nine target 
sequences are predicted to be well accessible (Table 9 in Fig. 32). 
Advantages of Present Invention 

Long RNAs may be trapped in locally stable structures. Furthermore, for 
long-chain RNAs, there are many suboptimal foldings with free energies close to the 
minimum free aiergy. It has been a practical problem for antisense experimentalists 
to select one of the low free energy structures as the basis for antisense design. The 
suboptimal foldings from mfold do not guarantee a statistically unbiased sample of 
probable secondary structures. This makes it difficult to assign a statistical measure 
of confidence for predictions based on these suboptimal foldings. It is possible tiiat 
each mRNA exists as a population of different structures, and a stochastic approach 
to accessibility evaluation may be appropriate (Christqffersen et al.). By 
summarizing a statistical sample of probable structures in a single plot, the 
probability profile approach of the present invention overcomes these difficulties. 
The "well-determined" single-stranded regions are revealed by peaks with high 
probabilities on the profile. Statistical sampling of probable structures provides a 
suitable means to address these long-standing issues. This is demonstrated by the 
substantial improvement in predictions over the minimum free energy structure. 
The sampling method also has the advantage that it does not require the generation 
of a huge number of all possible structures. For antisense nucleic acid design, the 
structure sampling algorithm and probability profiling are better suited to tiie 
evaluatioii of target acpessibility. 
Fimctional Genomics 

The completion of the sequencing of the hmnan genome signals the dawn of 
a new era in biomedical research. Of the estimated 30,000 ! 40,000 genes in the 
human genome, definitive functions have been assigned to only a few percent. 
Functional genomics is concerned with the detennination of biological fimctions for 
all of the genes and their protem products on a genome-wide scale. Inactivation of a 
gene is the classical approach to assign a fimction to a gene in higher organisms. In 
the post-genomic era, however, gene knockout and mutagenesis, the traditional 
"gold standard" tools, can no longer keep pace with new sequence information 
rapidly accumulated from various gaiome projects. Therefore, antisense nucleic 
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acids that target mRNA have emCTged as attractive reverse genetic tools for high 
throughput functional genomics; Recently, the potential of these KNA-targeting 
techniques has been demonstrated, through the identification of functional genes by 
ribozymes in mammalian cells; through chromosome wide phenotypic screening by 

5 RNAi in C. elegans; and through genome-wide gene functional alterations by an 
antisense approach in Candida albicans. The importance of these techniques is 
further evidenced by the steady increase in the annual number of antisense (Fig. 5 in 
Ding 2002, attached in the Appendix) and ribozyme papers listed in PubMed, and 
the recent explosion of RNAi papers in the literature, 

10 Complicated multi-component biological systems can be studied by 

antisense nucleic acids to independently block the synthesis of each individual 
protein in the system. Antisense also promises to reveal genetic pathways through 
expression arrays. By inhibition of protein expression and target mRNA, and 
through the evaluation of inhibitory effects on expression of genes on DNA arrays, 

15 insight will be gained on the gene interaction and regulatory pathways. 

Drug Target Validation 

Thousands of new potential therapeutic targets have emerged from human 
genome sequencing. The selection and validation of molecular targets are of 
paramount importance for drug development in the new millennium. Antisense 
20 nucleic acids are important tools for the validation ofhuman therapeutic targets. 

High Throug h put Applications 

DNA expression arrays, which allow the measurement of gene expression 
patterns of tens of thousands of genes in parallel, have emerged as major high- 
throughput experimental tools in the post-genomic era. DNA expression arrays can 

25 provide important clues to gene function. Genes of similar expression behavior 
suggest that they are likely to be co-regulated or possibly fimctionally related. 
Ittdeed, statistical clustering analysis has revealed that gene expression data tend to 
organize genes into functional categories. Genes with unknown function can be 
assigned tentative fimctions or a role in a biological process based on tiie known 

30 function of genes in the same cluster. 

Single-nucleotide polymorphisms ("SNPs") promise to propel forward 
pharmacogenomics, the emerging field concerned with the dissection of the genetic 
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basis of disease and ther^eutic response. SNPs enable studies of association 
between a SNP and risk of a disease or drug response. These associations are 
valuable for the identification of candidate genes for disease phenotypes. 

The eventual determination of the functions of the candidate genes, and 

5 confinnation of gene functional predictions based on analysis of DNA expression 
arrays, will require experimental analysis in a systematic and high throughput 
fashion to keep pace with the fast growing genome, expression array, and SNP 
databases. Antisense nucleic acids are well suited for this endeavor. Expression 
array and SNP databases can provide the basis for higji throughput antisense nucleic 

1 0 acid applications to functional genomics and drug target validation. 

Experimental approaches for finding potent antisense nucleic acids are 
expensive, time consuming, and laborious, and are usually limited to a region of the 
target KNA. Published work suggests that, at the very best, only one in eight 
antisense oligonucleotides is effective. To realize the promise of antisense nucleic 

1 5 acids for high-throughput functional genomics and drug target validation, efficient 
screening for identifying accessible sites on the target RNA is necessary. This must 
be based on the combmation of a high throughput experimental platform and 
rational computational method. For example, for the design of antisense oUgos, the 
combinatorial RNA:DNA oUgonucleotide array technique appears to be an adequate 

20 experimental approach. With labeled transcripts, hybridization mtensity can be 
measured and visualized. However, there are seemingly two practical limitations. 
First, file number of all possible oligomers up to a preset lengtti is huge for an 
mRNA. Secondly, large mRNAs can be hampered by their bulky size fix)m 
approaching the oligomers densely distributed on the array surface. Use of selective 

25 oligomers designed by comprehensive computational screening provides a solution. 
Hence, in accordance with an embodiment of the iuvention, a strategy of mtegrating 
computational predictions and experimental techniques such as oligonucleotide 
array for a rational, efficient, and comprehensive platform for antisense nucleic acid 
screening may be used, as shown in Fig. 33. 

30 Folding and AccessibiUtv Prediction for DNA Targets 

The focus of file description of the invention has been on RNA targets, however, the 
algorithms for prediction of secondary stracture and target accessibiUty can be 
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Straightforwardly applied to DNA targets by using DNA thermodynamic parameters, 

such as summarized by iS'antoZ.t^cza. 

Desi^ of Oligonucleotide Probes and Molecular Beacons 

The folding and accessibility prediction for either RNA or DNA targets are valuable 
5 for the design of oligonucleotide probes such as molecular beacons for effective 

hybridization to the target. Molecular beacons are dual-labeled oligonucleotide 

probes that are capable of forming a stem-loop structure in the absence of target 

(Tyagi & Kramer). The loop portion of the molecule is a probe sequence that is 

complementary to a predetermitied sequence in a target nucleic acids. The probes 
10 fluoresce only when they hybridize to their complementary targets. When introduced 

into living cells, these probes may enable the origin, movement and fate of mRNAs 

to be traced. 

Other Applications 

Studies of infectious pathogens. Functional studies of genes and their 
1 5 products for CDC high priority pathogens are important for biodefense. For 

example, for the causative agent of plague. Yersinia pestis, the functions are yet 

unknown for 1,066 of the 4,012 protein-coding genes 

(http://www.sanger.ac.uk/ProjectsA^_pestis/ ). In contrast to the RNAi silencing 
mechanisms that only functions in eukaryotes, antisense oligos and more recently 

20 ribozymes have been demonstrated to be effective in bacterial systems. Thus, gene 
inhibition by antisense oligos or ribozymes are important for applications to high 
priority pathogens for biodefense. 

Studies of small regulatory RNAs. Recently, small non-coding RNAs have 
gained increasing attention for their broad regulatory functions. In particular, 

25 microRNAs (miRNAs) are single-stranded antisense RNAs of 21-22 nt that are 
believed to target 3' untranslated regions for mediating negative post-transeriptional 
regulation. For C elegans, more tlian 100 miRNAs have been discovered. However, 
it is a challenge to identify the particular target for each miRNA. Because the 
antisaise target site is likely to be largely unstructured, the combination of sequence 

30 alignment and an analysis of accessibility by probability profiling will constitute a 
promising strategy for addressing this problem. 

Improved structure prediction for homologous RNAs. Improved structure 
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predictions for homologous RNAs, in particular, naRNAs, may be possible by taking 
advantage of both the statistical sampling paradigm and the potential conservation in 
structure for sequences of related species available from genome sequencing 
projects. This will in turn improve the prediction of target accessibility for antisense 

5 nucleic acid design. 

Algorithm extensions to permit experimental and deterministic constraints. 
Experimental information on secondary structure can be mcorporated into an 
algorithm to improve predictions by eliminating biochemically invalid stractures. 
Several types of experimental constraints are: a base is paired (partner unknown), a 

1 0 forced base pair, an unpaired base, and an unwanted base pair. These constrMnts can 
be extended to consecutive bases or base pairs. Base pairing can also be prohibited 
betweai two regions. These constraints are deterministic, because it is implicitly 
assumed that there is no uncertainty in the assigmnent of base pairs or unpaired 
bases. For mathematical algorithms, the constraints can be handled by assigning a 

1 5 large penalty energy (e.g., an unwanted base pair) or a bonus energy (e.g., a forced 
base pair) in the forward recursions. Similarly, free energies may be adjusted in the 
calculation of the partition functions to address constraints. The sampling 
probabilities are adjusted accordingly, such that sampled structures meet the 
constraints. The bonus energy treatment can be a problem, because large bonuses 

20 cause overflows of partition ftihctions. An alternative to assigning a bonus is to 

penalize all opposite cases. For a base forced to pair, e.g., a large penalty energy can 
be assigned to the cases of the base being ui^aired. 

Algorithm extensions to permit experimental and probabilistic constraints. There is 
often variation in the intensity of the reaction in enzymatic or chanical probing. 

25 Weak to very strong enzymatic cuts can be indicated by different levels of intensity 
on an electrophoretic gel. This probably reflects some heterogeneity in the RNA 
population as a result of transient intra-moleculaf interactions and molecular 
'Tjreathing" of weak base pairs. Another reason for the variability may be the steric 
hindrance problem due to the buUdness of RNases. The variability introduces 

30 uncertainty in the assignment of base pairs or unpaired bases from the reaction data. 
A probabilistic approach can address the uncertainty. Assigmnent of probabilities 
have been considered for base pairs using enzymatic digestion data in a heuristic 
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matrix method for stracture modeling (Quigley et al). Pooling of infonnation from 
several reactions by calculating renormalized probabilities have also been 
considered. The uncertain base pairs and unpaired bases together with their 
probabilities define what are cslled probabilistic constraints. A two-step method 

5 may be conisdered to accommodate such constraints. The first step is a "coin flip" 
step for simulating deterministic constraints by sampling with the probabilities. The 
collection of outcomes defines a set of deterministic constraints. In step two, a 
secondary structure is sampled with the algorithm for detaministic constraints. This 
two-step process is repeated to generate a sample of structures. An alternative is to 

1 0 include probabilities and their corresponding deterministic constraints in a single 
round of calculation of the partition functions by a possibly a weighting scheme. 

Algorithm extensions for H-pseudoknot prediction. A set of parameter 
estimates for H-pseudoknots, important tertiary structure motifs has been compiled 
(Gultyaev et al). This parameter set is based on experimentally and/or 

15 phylogaietically proven pseudoknots. An efficient algorithm based on the present 
invention for H-pseudokhot prediction may take the following steps: 

1. Sample a large number of secondary structures with the statistical sampling 
algorithm. 

2. Identify all hairpins for each sampled structure, and predict H-pseudoknots by 
20 evaluating stabilities with the parameters for H-pseudoknots . 

3. Compute the sampling estimates of probabilities of the predicted H-pseudoknots. 

This procedure evaluates stabilities of potential H-pseudoknots after the 
prediction of an unknotted structure. It has several advantages: (1) a sample 
simulated by the rigorous sampUng algorithm reflects the Boltzmann ensemble of 

25 the secondary stmctures. The resulting predictions of H-pseudoknots are based on an 
unbiased sample of probable alternatives rather than a single optunal or a few 
suboptimal stiuctiires; (2) the algoritim wiU be able to incorporate credible free 
energy estimates for H-pseudoknots and retum probabilities of predicted H- 
pseudoknots for an assessment of confidence in tiie predictions; (3) because of tiie 

30 fast sampling algoritiim, the procedure will be efficient; (4) this approach can be 
easily extended to predict more general types of pseudoknots when credible 
parameters are available. The extension only requires the identification of all loop 
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regions in step 2. 

Sampling fi-amework for folding of multiple nucleic acids and other type of 
biomolecules. The sampling approach disclosed in the invention may be applicable 
to folding of multiple nucleic acids and other type of biomolecules such as proteins, 
5 by computing partition functions with energy parameters and sampling molecular 
coniformtations. For example, for two nucleic acid molecules, prediction of folding 
may involve the following basic steps: 

1. Calculation of joint partition functions of the two molecules using free energy 
parameters. 

10 2. Inclusion of molecular concentrations. 

3. Sampling of bimolecular conformations using probabilities computed with 
calculations in step 1 and 2. 

A Soiiware for Statistical Folding and Rational Design of Nu cleic Acids 

Sfold is a suite of statistical nucleic acid folding software. Sfold currently has four 

1 5 modules with a focus on antisense nucleic acid design: Sma, Soligo, Sribo, and 
Sima. Sma ojffers general features for statistical RNA folding, and Soligo presents 
tools for target accessibility prediction and tiie rational design of antisense oligos. 
Sribo provides both graphical and quantitative tools for target accessibility 
prediction and the rational design of fra«s-cleaving ribozymes. It will allow user 

20 input of ribozyme type (hammerhead or hairpin), preferred cleavage sequence (e.g., 
GUC for hammerhead), target RNA, conserved and variji>le portions of the 
ribozyme, and possibly other information for user-friendly applications. Sinui offers 
tools for target accessibility prediction and the rational design of siRNAs for RNA 
intferference. Furthermore, the tools for antisense accessibility are useful for design 

25 of oligonucleotides probes such as molecular beacons for nucleic acid hybridization. 
Veisioni.O of Sfold has been developed, and a Web server for on-line applications 
will be located at http://www.wadsworth.org/Sfold and/or 
http://www.bioinfo.rpi.edu/^lications/Sfold. 

It will tiius be seen that the objects set forth above, among those made 

30 apparent from the preceding description, are efficiently attained and, because certain 
changes may be made in carrying out the above method(s) and in the construction(s) 
set forth without departing from the spirit and scope of the invention, it is intended 
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that all matter contained in the above description and shown in the accompanying 
drawings shall be interpreted as illustrative and not in a limiting sense. 

The present mvention may be described by the following numbered 
paragraphs. 

1 . A statistical algorithm for generating a sample (of any desired size) of 
probable secondary structures for a given RNA sequence exactly and rigorously with 
Boltzmann equilibrium probabiUties of RNA secondary structures comprising the 
steps of: 

a) calculating partition functions using latest Turner thermodynamics 
parameters; and 

b) performing random tracebacks using conditional probabiUties 
computed with partition functions. 

2. An extension of the algorithm of paragraph 1 to compute 
probabiUties of one or more stinctural motife with or without constraints for an RNA 

molecule comprising the steps of: 

a) generation of a sample of probable secondary structures with the 
algorithm of paragraph 1; 

b) estunation of the probability of a structural motif by using the 
observed frequency of the motif in the sample. 

3. An extension of the algorithm of paragraphs 1 or 2 wherein said one 
or more structural motife includes one of a helix and a loop. 

4. The calculation of Boltzmann-probabiUty-weighted density of states 
(BPWDOS) and free energy distributions comprising the st^s of: 

a) generation of a sample of probable secondary structures witii the 
algorithm of paragraphs 1, 2 or 3; 

b) calculation and display of BPWDOS, the distiibution over free 
energy intervals for sampled stiiictures (i.e, free enargy histogram); 

c) calculation and display of the distribution for the probabiUty that 
the free raiergy of a structure is wifliin a threshold of the global 

minimum; 

d) calculation and display of the distribution for the probabiUty that 
tiie free energy of a structure is within, an energy interval. 
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5. An extension of the algorithm of paragraphs 1, or 2, or 3 to compute 
probability profiles of single-stranded bases or single-stranded segments of any 
number of bases for a complete statistical delineation of potential antisense 
nucleation sites on the entire target RNA comprising the steps of: 

5 a) generating a sample of probable secondary structures with the 

algorithm of paragraphs 1 , or 2, or 3 ; 

b) estimating the probabihty that a base or a segment of bases of 
specified length is single^stranded by using the observed frequaicy 
in the sample; and 

10 c) repeating above step for all bases or segments on the target RNA 

for complete profiles. 

6. Thecalculationof a sampling-probability-weighted fi-ee energy (AG 

nucleation) for measuring the nucleation potential oJFthe hybridization between an 
antisense oligo and its target sequence on mRNA and the use of AG nucleation and the 

15 probabihty profiles of paragraphs 1, or 2 or 3 for an automated ranking of all 

antisense oligos of any specified length, and consequently an automated computer 
selection and design process of antisense oligos. The calculation uses the 
probabihties from tiie profiles as weights in the summation of RNAtDNA 
thermodynamic parameters for the hybrid. 

20 7. The use of the algorithm of paragraphs 1 and the extension of 

paragraph 2 and/or any index or procedure based on the algorithm or the extension 
for target prediction, screening and design of antisense oUgos for fimctional 
genomics, drug target validation and development of antisense therapeutics. 
8. The use of the algorithm of paragraph 1 aind the extension of 

25 paragraph 2 and/or 3 and/or any index or procedure based on the algorithm or the 
extension for. 

a) predicting a potential effective target for a ribozyme of a specified 
type (e.g., hammerhead, hairpin) with a specified cleavage site 
(e.g., GUC for hammerhead ribozyme); 
30 b) evaluating the accessibiUty of the substirate-binding arms of the 

ribozyme resulted from the predicted target, and the mutual 
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accessibility between the binding anns and the substrate with the 
probability profiles for the ribozyme and the target RNA; and 
c) using the designed ribozymes for functional genomics, drug target 
validation, and development of ribozymes for human therapeutics. 
5 9. A statistical algorithm for generating a sample (of any desired size), 

of probable secondary structures for a given DNA sequence based on any set of 
DNA thermodynamics parameters comprising the steps of: 

a) calculating partition functions using DNA thermodynamics 
parameters; 

10 b) performing random tracebacks using conditional probabilities 

computed with partition functions 

10. The use of the algorithm of paragraph 1 and/or 2 and/or 3 and the 
extension of paragraph 4 with RNA or DNA thermodynamics parameters and/or any 
index or procedure based on the algorithm or the extension for the design of 

15 oligonucleotide probes for enhancing signals on nucleic acids hybridization arrays 
and thus producing higher quality array data for analysis. 

11. A method of generating a sample of a predetermined number of 
probable secondary structures of an RNA sequence, comprising the steps of: 

a) generating one or more partition functions of a fragment having 
20 ' one or more bases of the RNA sequoice in accordance with a 

predetermined number of thermodynamics parameters; and 

b) generating secondary structures based on tracebacks using 
conditional probabiUties computed with the partition fimctions. 

12. The method of paragraph 1 1, wherein the thermodynamics 

25 parameters include a predetermined number of free energies for basic structural 
elements. 

13. The method ofparagraph 11, wherein the thermodynamics 
parameters include free energies for base pair stacking in a helix. 

14. The method ofparagraph 1 1, wherein the partition function 

30 generating step generates partition functions for all fragments of the RNA sequence. 

15. A method of generating a complete statistical delineation of potential 
antisense nucleation sites on a target RNA, comprising the steps of: 
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a) generating a sample of one or more probable secondary structures 
of an RNA sequence by: 

i) generating one or more partition functions of a fragment having 
one or more bases of the RNA sequence in accordance with a 

5 predetermined number of thermodynamics parameters, and 

ii) generating secondary structures based on tracebacks using 
conditional probsibiUties computed with the partition fimcti^^ 

b) estimating a probabiUty that a segment of one or more bases on the 
target RNA is single-stranded in accordance with an observed 

10 frequency in the sample; and 

c) repeating the estimating step for ail bases on the target RNA. 
16. A method of determining an antisense oUgo of a predetermined 

length for an antisense nucleation site on a target RNA, comprising the steps of: 

a) generating a sample of one or more probable secondary structures 
15 of an RNA sequence by: 

i) generating one or more partition functions of a fragment having 
one or more bases of the RNA sequence in accordance with a 
predetermined number of thermodynamics parameters, and 

ii) generating secondary structures based on tracebacks using 

20 conditional probabilities computed with the partition functions; 

b) estimating a probability that a segment of one or more bases on the 
target RNA is single-stranded by usmg an observed frequency in 
the sample; 

c) repeating the estimating step for all bases on the target RNA; 
25 d) identifying a target segment in accordance with the estimated 

probabilities; 

e) determining a base sequence of the target segment; and 

f) determining the antisense oligo in accordance with the base 
sequence. 

30 17. A method of evaluating an antisense oUgo for a target RNA, 

comprising the steps of: 
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a) generating a sample of one or more probable secondary structures 
of an RNA sequence by: 

i) generating one or more partition functions of a fragment haying 
one or more bases of the RNA sequence in accordance with a 

5 predetermined number of thermodynamics parameters, and 

ii) generating secondary stmctures based on tracebacks using 
conditional probabilities computed with the partition functions; 

b) estimating a probabiUty that a segment of one or more bases on the 
target RNA is single-stranded in accordance with an observed 

10 frequency in the sample; and 

c) repeating the estimating step for all bases on the target RNA; 

d) calculating a sampling-probability-weighted free energy for 
measuring ^enucleation potential of the hybridization between 

the antisense oligo and the target RNA4 and 
15 e) generating an evaluation indicator for the antisense oligo in 

accordance with tiie sampling-probability-weighted free energy . 
and the estimated probabilities for the target RNA. 

18. The method of paragraph 17, wherein the calculating step includes 
applying the estimated probabilities as weights in a summation of RNA:DNA 

20 thermodynamic parameters for the hybrid. 

19. A computer program embodied on a computer-readable medium for 
generating a sample of a predetermined number of probable secondary structiires of 
an RNA sequence, comprising: 

a) an instruction for generating one or more partition functions of a 
25 fragment having one or more bases of the RNA sequence in 

accordance with a predeteimined number of thermodynamics 

parameters; and 

b) an instruction for generating secondary structures based on 
tracebacks using conditional probabilities computed with the 

30 partition functions. 
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20. A computer program embodied on a computer-readable medium for 
generating a complete statistical delineation of potential antisense nucleation sites on 
a target RNA, comprising: 

a) an instruction for generating a sample of one or more probable 
5 secondary structures of an RNA sequence by: 

i) generating one or more partition functions of a fragment having 
one or more bases of the RNA sequence in accordance with a 
predetermined number of thermodynanaics parameters, and 

ii) generating secondary structures based on tracebacks using 

1 0 conditional probabilities computed with the partition functions; 

b) an instruction for estimating a probability that a segment of one or 
more bases on the target RNA is single-stranded in accordance 
with an observed frequency in the sample, wherein 

the estimating instruction is repeated for all bases on the target RNA, 
15 21 . A computer program embodied on a computer-readable medium for 

determining an antisense oUgo of a predetermined length for an antisense nucleation 
site on a target RNA, comprising: 

a) an instruction for generating a sample of one or more probable 
secondary structures of an RNA sequence by: 
20 i) generating one or morie partition functions of a fragment having 

one or more bases of the RNA sequence in accordance with a 
predetermined number of thermodynamics parameters, and 
ii) generating secondary stractures based on tracebacks using 
conditional probabilities computed with the partition functions; 
25 b) an imtruction for estimatmg a probability that a segment of one or 

more bases on the target RNA is single-stranded by using an 
observed frequency in the sample, said estimating instruction 
being repeated for all bases on the target RNA; 

d) an instruction for identifying a target segment in accordance with 
30 the estimated probabiUties; 

e) an instruction for determining a base sequence of the target 
segment; and 
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f) an instruction for detennining the antisense oligo in accordance 
with tiie base sequence. 
22. A computer program embodied on a computer-rieadable medium for 
evaluating an antisense oligo for a target RNA, comprising: 
5 a) an instruction for generating a sample of one or more probable 

secondary structures of an RNA seqvience by: 
i) generating one or more partition fimctions of a fragment having 
one or more bases of the RNA sequence in accordance with a 
predetermined number of thermodynamics parameters, and 
0 ii) generating secondary structures based on tracebacks using 

conditional probabilities computed with the partition fimctions; 
b) an instruction for estimating a probability that a segment of one or 
more bases on the target RNA is single-stranded in accordance 
with an observed frequency in the sample, said estimating 
[5 instruction bOTig repeated for all bases on tiie target RNA; 

d) an instruction for calculating a sampling-probabiUty-wei^ted free 
energy for measuring the nucleation potential of the hybridization 
between.the antisense oUgo and the target RNA; and 

e) an instiniction for generating an evaluation indicator for the 
20 antisense oligo in accordance with the 

sampling-probability-weigjited free energy and the estimated 

probabilities for the target RNA. 

23 . A process embodied in an instiruction signal of a confuting device 

for generating a sample of a predetermined number of probable secondary stiructiures 

25 of an RNA sequence, comprising: 

a) an instruction for generating one or more partition functions of a 

fragmait having one or more bases of the RNA sequence in 

accordance with a predetermined number of thermodynamics 

parameters; and 

30 b) an instruction for generating secondary stiiictures based on 

tracebacks using conditional probabilities computed with the 
partition fimctions. 
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24. A process embodied in an instruction signal of a computing device 
for generating a complete statistical delineation of potential antisense nucleation 
sites on a target RNA, comprising: 

a) an instruction for generating a sample of one or more probable 
5 secondary structures of an RNA sequence by: 

i) generating one or more partition fimctions of a fragment having 
one or more bases of the RNA sequence in accordance Avith a 
predetermined number of thermodynamics parameters, and 

ii) generating secondary structures based on tracebacks using 

10 conditional probabilities computed with the partition functions; 

b) an instruction for estimating a probability that a segment of one or 
more bases on the target RNA is single-stranded in accordance 
with an observed frequency in the sample, wherein 

the estimating instruction is repeated for all bases on the target RNA. 
15 25. A process embodied ia an instruction signal of a computing device 

for determining an antisense oligo of a predeteraiined length for an antisense 
nucleation site on a target RNA, comprising: 

a) an instruction for generating a sample of one or more probable 
secondary structures of an RNA sequence by: 
20 i) generating one or more partition functions of a fragment having 

one or more bases of the RNA sequence in accordance with a 
predetemiined number of thermodynamics parameters, and 
ii) generating secondary structures based on tracebacks using 
conditional probabilities computed with the partition ftmctions; 
25 b) an instruction for estimating a probability that a segment of one or 

more bases on the target RNA is single-stranded by using an 
observed frequency in the sample, said estimating instruction 
bemg repeated for all bases on the target RNA; 

d) an instraction for identifying a target segment in accordance with 
30 the estimated probabilities; 

e) an instruction for deteraiining a base sequence of the target 
segment; and 
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f) an instruction for deterniining the antisense oligo in accordance 
with Hbs base sequence. 
26. A process embodied in an instruction signal of a computing device 
for evaluating an antisense oligo for a target RNA, comprising: 
5 a) an instruction for generating a sample of one or more probable 

secondary structures of an RNA sequence by: 
i) generating one or more partition functions of a fragmlent having 
one or more bases of the RNA sequence in accordance with a 
predetermined number of thermodynamics parameters, and 
1 0 ii) generating secondary structures based on tracebacks using 

conditional probabilities computed with the partition functions; 
b) an instruction for estimating a probability that a segment of one or 
more bases on the target RNA is single-stranded in accordance 
with an observed frequency in the sample, said estimating 
15 instruction being repeated for all bases on the target RNA; 

d) an instruction for calculating a sampling-probability-weighted free 
energy for measuring tlie nucleation potential of flie hybridization 
between the antisense oligo and the target RNA; and 

e) an instruction for generating an evaluation indicator for the 

20 antisense oUgo in accordance with the sampling-probability-weighted 

free energy and the estimated probabiliti^ for the target RNA. 
27. A method for the representation and characterization of the 
Boltzmann ensemble of RNA secondary structures, comprising the steps of: 

a) generation of a sample of probable secondary structures with the 
25 algorithm of paragraph 1; 

b) classification ofthesanq)led structures into classes of similar structures; 

c) calculation of the probability for each of the class using the frequency of 
the class in the sample; 

d) display of a class by two-dimensional or equivalent three-dimensional 
30 plot for the frequency of base pairs in the class; and 
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e) computation of the Boltzmann probability of the most probable structure 
(i.e., the structure with tlie lowest free energy) in a class as the class 
representative. 

5 28. A method for the representation and characterization of the 

Boltzmann ensemble of RNA secondary structures, comprising the steps of: 

a) generation of a sample of probable secondary structures witii the 
algorithm of paragraph 1; 

b) classification of the sampled structures into classes of similar 

10 structures; 

c) calculation of the probability for each of the class using the 
frequency of the class.in the sample; 

d) display of a class by two-dimensional or equivalent three- 
dimensional plot for the frequency of base pairs in the class; 

15 e) computation of the Boltzmann probability of the most probable 

structure (i.e., the structure with the lowest free raergy) in a class as 
the class representative. 

29. A metitiod for generating a mutual accessibiUty plot for evaluating the 
potential of RNA:RNA interaction, comprismg the steps of: 

20 a) generating probability profile with the algorithm in paragraph 5 for 

RNA molecule A; 

b) generatmg probability profile with the algorithm in paragraph 5 for 

RNA molecule B; 

c) overlay of the portions of the profiles in a senseiantisense 
25 orientation for the region of potential interaction where RNA 

molecule A and RNA molecule B have complementary bases. 

30. A method for target accessibility prediction and the rational design of 
antisense oligos, comprising the steps of: 

a) computation for the construction of the complete probability 
30 profile of the target RNA with the algorithm in paragraph 5; 

b) selection of accessible sites predicted by high probabiUty peaks on 
the profile; 
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c) selection of the antisense oligo of preferred length (e.g., 20 bases) 
for each accessible site with the strongest probability-weighted- 
binding energy calculated with RNA/DNA stacking energy 
parameters; 

5 d) avoidance of three contiguoxis Gs, a motif known to cause non- 

specific effects; 

e) performing aUgnment search (e.g., BLAST) to avoid significant 
homology to oiher genes in the experimental s>«tem. 
31. A method for tiarget accessibility prediction and the rational design of 

10 /ra«5-cleaving ribozymes, comprising the steps of: 

a) computation for the construction of the complete probability 
profile for the target RNA with the algorithm in paragraph 5; 

b) evaluation of accessibility of both the cleavage site (e.g., GUC for 
hammerhead ribozyme) and its flanking sequences; 

X 5 c) specification of the bases of the ribozyme binding arms and 

subsequently the ribozymes for accessible sites; 

d) computation of the probability profile for each designed ribozyme 
with the algorithm in paragraph 5; 

e) evaluation of accessibility of the ribozyme binding aims; 
20 f) evaluation of appropriateness of the structure of the catalytic 

domain of the ribozyme by stractiire classification for estimating tiie 
equilibrium between correct fold and alternatives; 
g) evaluation of mutual accessibility between tiie ribozyme binding . 
arms and their target sequences with the method in paragraph 29. 
25 32. A method for target accessibility prediction and the rational design of 

siRNAs, comprising die steps of: 

a) computation for the constraction of the complete probability 
profile of the target RNA with the algoritiim in paragraph 5; 

b) selection of accessible sequence (e.g., AA(N19) motife, where N 
30 is any nucleotide) of desired length (e.g., 21-23 nt) on the target; 

c) computation of probability-weighted-binding energy using the 
algorithm in paragraph 7 with RNA:DNA tiieimodynamic parameters 
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replaced by RNA:RNA stacking energy parameters for the duplex 
formed between each selected target sequence and the antisense 
strand siRNA; 

d) computation of GC content for selection of target sequences with 
5 preferred GC content (e.g., low to balanced GC); 

e) performing alignment search (e.g., BLAST) to avoid significant 
homology to other genes in the experimental system. 

33. Frameworics based on the algorithms in paragraphs 1 and/or 5 for 
applications to studies of infectious pathogens for biodefence, studies of small 

10 regulatory RNAs, improved structure prediction for homologous RNAs, algorithm 
extensions to permit experimental constraints and to predict'H-pseudoknots, folding 
prediction of multiple nucleic acids and other types of biomolecules such as 
proteins. 

34. A software named Sfold for statistical nucleic acid folding, and for 
1 5 target accessibility prediction and the rational design of antisense oligos, trans- 

cleaving ribozymes, siRNAs and other RNA-targeting molecules, and design of 
oligonucleotide probes such as molecular beacons. 

35. A computer program embodied on a computer-readable medium for 
target accessibiUty prediction and the rational design of antisense oUgos, 

20 comprising: 

a) an instruction for computation for the construction of the complete 
probabiUty profile of the target RNA with the algorithm in paragraph 

5; 

b) an instruction for election of accessible sites predicted by high 
25 probability peaks on the profile; 

c) an instruction for selection of the antisense oUgo of preferred 
length (e.g., 20 bases) for each accessible site with the strongest 
probabiUty-weighted-binding energy calculated with RNA:DNA 
stacking energy parameters; . 

30 d)/an instruction for avoidance of three contiguous Gs, a motif known 

to cause non-specific effects; 



83 



I 



wo 03/065281 ^ PCT/US03/02644 





e) an instruction for perfoiming alignment search (e.g., BLAST) to 
avoid significant homology to other genes in the experimental 
system. 

36. A computer program embodied on a computer-readable medium for 
5 target accessibility prediction and the rational design of /ran j-cleaving ribozymes, 
comprising: 

a) an instruction for computation for the construction of the complete 
probabUity profile for the target RNA with the algorithm in paragr^h 
5; 

IQ b) an instruction for evaluation of accessibility of both fee cleavage 

site (e.g., GUC for hammerhead ribozyme) and its flanking 
sequences; 

c) an instruction for specification of the bases of the ribozyme 
binding aims and subsequently the ribozymes for accessible sites; 
15 d) an instruction for computation of the probability profile for each 

designed ribozyme with the algorithm in paragraph 5; 

e) an instruction for evaluation of accessibility of flie ribozyme 
binding arms; 

f) an instruction for evaluation of s^propriateness of the structure of 
20 the catalytic domain of the ribozyme by stricture classification for 

estimating tiie equilibrium between correct fold and alternatives; 

g) an instruction for evaluation of mutual accessibility between the 
ribozyme binding aims and their target sequences with the method in 
paragraph 29. 

25 37. A computer program embodied on a computer-readable medium for 

target accessibility prediction and the rational design of siRNAs, comprising: 

a) an instruction for computation for tiie constraction of the complete 
probability profile of the target RNA with the algorithm in paragraph 
5; 

30 b) an instruction for selection of accessible sequence (e.g., AA(N19) 

motife, where N is any nucleotide) of desired length (e.g., 21-23 nt) 
on the target; 
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c) an instruction for computation of probability-weighted-binding 
energy using the algorithm in paragraph 6 with RNArDNA 
thermodynamic parameters replaced by RNA:RNA stacking energy 
parameters, for the duplex formed between each selected target 

5 sequence and the antisense strand siRNA; 

d) an instruction for computation of GC content for selection of target 
sequences with preferred GC content (e.g., low to balanced GC); 

e) an instraction for performing aligmnent search (e.g., BLAST) to 
avoid significant homology to other genes in the experimental 

10 system. 

38. A process embodied in an instruction signal of a computiag device 
for target accessibility prediction and the rational design of antisense oligos, 
comprising: 

a) an instruction for computation for the construction of the complete 
1 5 probability profile of the target RNA with the algorithm in paragraph 

b) an instruction for election of accessible sites predicted by high 
probability peaks on the profile; 

c) an instruction for selection of the antisense oligo of preferred 
20 length (e.g., 20 bases) for each accessible site with the strongest 

probability-weighted-binding energy calculated with RNArDNA 
stacking CTiergy parameters; 

d) an instruction for avoidance of three contiguous Gs, a motif known 
to cause non-specific effects; 

25 e) an instruction for performing alignment search (e.g., BLAST) to 

avoid significant homology to other genes in the experimental 
sjrstem. 

39. A process embodied in an instruction signal of a computing device 
for target accessibility, prediction and the rational design of /rfl;i5-cleaving 

30 ribozymes, comprising: 
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a) an instruction for computation for the construction of the complete 
probability profile for the target RNA with the algorithm in paragraph 
5; 

b) an instruction for evaluation of accessibility of both the cleavage 
5 site (e.g., GUC for hammerhead ribozyme) and its flanking 

sequences; 

c) an instruction for specification of the bases of the ribozyme 
binding arms and subsequently the ribozymes for accessible sites; 

d) an instruction for computation of the probability profile for each 
10 designed ribozyme with the algorithm in paragraph 5; 

e) an instruction for evaluation of accessibility of the ribozyme 
binding arms. 

f) an instmction for evaluation of appropriateness of the structure of 
the catalytic domain of the ribozyme by structure classification for 

1 5 estimating the equilibrium between correct fold and alternatives; 

g) an instruction for evaluiation of mutual accessibility between tiie 
ribozyme binding arras and their target sequences with the method in 
paragraph 29. 

40. A process embodied in an instruction signal of a computmg device 
20 for target accessibility prediction and the rational design of siRNAs, comprising: 

a) an instruction for computation for the construction of the complete 
probability profile of the target RNA with the algorithm in paragraph 

b) an instruction for selection of accessible sequence (e.g., AA(N19) 
25 motifs, where N is any nucleotide) of desired length (e.g., 21-23 nt) 

on the target; 

c) an instmction for computation of probability-weighted-bindmg 
energy using flie algorithm in paragraph 6 with RNA:DNA 
thermodynamic parameters replaced by RNA:RNA stackmg energy 

30 parameters, for the duplex formed between each selected target 

sequence and the antisense strand siRNA; 



86 



wo 03/065281 ^ PCT/US03/02644 





d) an instruction for computation of GC content for selection of target 
sequences with preferred GC content (e.g., low to balanced GC); 

e) an instruction for performing aligranent search (e.g., BLAST) to 
avoid significant homology to other genes in the experimental 

5 system. 

41 . The calculation of a sampling-probabiUty-weighted binding energy 
(AG nucieation) for measuring the nucleation potential of the hybridization between an 
antisaise oligo and its target sequence on RNA. The calculation uses the 
probabilities on the profiles firom paragraph 6 as weights in the summation of 

1 0 RNA:DNA thermodynamic parameters for the hybrid. 

42. The use of the algorithm of paragraph 1 and the extension of 
paragraph 2 and/or any index or procedure based on the algorithm or the extension 
for target prediction, screening and design of antisense nucleic acids for fimctional 
genomics, drug target validation and development of RNA-targeting therapeutics. 

1 5 The invention fijrther comprehends the transmission of information, e.g., 

antisense or ribozyme or siRNA information, target prediction information, 
ioformation fi-om screening and/or design of antisense nucleic acids, e.g., as to 
functional genomics, drug target validation and devetopment of RNA-targeting 
therapeutics, information on the design of oligonucleotide probes (e,g., molecular 

20 beacons), for instance for enhancing signals on nucleic acids hybridization arrays 
and thus producing higher quality array data for analysis, from any of tiie herein 
methods, algorithms, or applications thereof; for example, transmission via a global 
communications network or the intemet, e.g., via Web site posting, such as by 
subscription or select or secure access thereto and/or via email and/or via telephone, 

25 IR, radio or television other firequency signal, and/or via electronic signals over 

cable and/or satellite transmission and/oi: via transmission of disks, cds, computers, 
hard drives, or other apparatus containing the information in electronic form, and/or 
transmission of written forms of the information, e.g., via facsimile transmission and 
the like. Thus, the invention comprehends a Mser performing methods or using 

30 algorithms according to the invention and transmitting information therefirom; for 
instance, to one or more parties who then fiirther utilize some or all of the data or 
information, e.g., in the manufacture of products, such as therapeutics, antisense 
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nucleic acids, probes, assays, etc. The invention also comprehends disks, cds, 
computers, or other apparatus or means for storing or receiving or transmitting data 
or information containing information frota methods and/or use of algorithms of the 
invention. 

5 Thus, the invention comprehends a method for transmitting information 

comprising performing a method as discussed herein and transmitting a result 
thereof. 

The invention also comprehends a method for target prediction, or for 
screening or designing of antisense oUgos, trans-cleaynng ribozyme or siRNAs; or 
10 for performing functional genomics, or for drug target validation, or for 

development of antisense therapeutics, or for the design of oligonucleotide probes 
(e.g., molecular beacons), or for enhancing signals on nucleic acids hybridization 
arrays, or for producmg higher quality array data, comprising performing a method 
as herein discussed or using the algorithm as herein discussed. A result or results 
15 from the method or use of the algorithm may be correlated to target prediction, or .' 
screening or designing of antisense nucleic acids, or performing functional 
genomics, or drug target validation, or development of RNA-targeting therapeutics, 
or the design of oligonucleotide probes (e,g., molecular beacons), or enhancing 
signals on nucleic acids hybridization arrays, or producing higher quality array data. 
20 The invention further comprehends a method for transmitting information for 

target prediction, or for screening or designing of antisense nucleic acids, or for 
perfonning functional genomics, or for drug target validation, or for development of 
antisense nucleic acids as therapeutics, or for the design of oligonucleotide probes 
(e.g., molecular beacons), or for enhancing signals on nucleic acids hybridization 
25 arrays, or for producing higher quality array data, comprising performing a method 
as herein discussed or using the algoritiim as herein discussed, and transmitting a 
result thereof. A result or results may be correlated to target prediction, or screening 
or designing of antisense nucleic acids^ or performing functional genomics, or drug 
target validation, or development of RNA-targeting therapeutics, or the design of 
30 oUgonucleotide probes (e.g., molecular beacons), or enhancing signals on nucleic 
acids hybridization arrays, or producing higher quality array data. Advantageously 
information transmission is via electronic means, e.g., via email or the internet 
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Further still, the invention comprehends methods of doing business 
comprising performing some or all of a herein metiiod or use of a herein algorithm, 
and communicating or transinitting or divulging a result or the results thereof 
advantageously in exchange for compaisation, e.g., a fee. Advantageously the 
5 communicating, transmitting or divulging is via electronic means, e.g., via internet 
or email, or by any otibier transmission means herein discussed. 

Thus, a first party, "client" can request information, e.g., via any of the 
herem mentioned transmission means - either previously prepared infomwtion or 
information specially ordered as to a particular nucleic acid molecule - such as, for 
1 0 example, for or on target prediction or for or oii identification of accessibel sites on 
* target RNA for gene down-regulation, or for or on identification of single-stranded 
regions in the secondary structure of a nucleic acid molecule, or for or on screening 
or designing of antisense oligos or firajis-ribozymes or siRNAs, or for or on 
performing functional genomics, or for or on drug target validation, or for or on 
15 development ofRNA-targeting therapeutics, or for or on the design of 

oligonucleotide probes (e.g., molecular beacons), or for or on enhancing signals on 
nucleic acids hybridization arrays, or for or on producing higher quality array data, 
of a second party, "vendor", e.g., requesting information via electronic means such 
as via internet (for instance request typed into website) or via email, and the vendor 
20 can transmit that information, e.g., via any ofthe transmission means herein 
mentioned, advantageously via electronic means, such as internet (for instance 
secure or subscription or select access website) or email: the information can come 
from performing some or all of a herein method or use of a herein algorithm in 
response to the request, or from performing some or all of a herein method or use of 
25 a herein algoritlim, and generating a library of information from performing some or 
all of a herein method or use of a herein algorithm and meeting the request can then 
be allowing the client access to the library or selecting data firom the library that is 
responsive to the request. 

Accordingly, the invention even further comprehaids collections of 
30 information, e,g., in electronic fomi (such as forms of transmission discussed 

above), from performing a herein method using a herein or portion thereof or using a 
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herein algorithm or performing some or all of a herein metliod or use of a herein 
algorithm. 

And the invention comprehends linked or networked computers sharing 
and/or transmitting information from performing a herem method using a herein or 

5 portion thereof or using a herein algorithm or performing some or all of a herein 
method or use of a herein algorithm, such as a server or host computer contaming 
such information and computer or conqjuters, either on the same premises as the 
server or host computer or remotely situated accessing that information, whereby 
"transmission" can include the linking of such computers and the access to the 

10 information by the remote computer. 

* * * 

It will thus be seen that the objects set forth above, among those made 
apparent from the preceding description, are efficiently attained and, because certain 
changes may be made in.carrying out the above method(s) and in the construction(s) 
15 set forth without departing from the spirit and scope of the invention, it is intended 
that all matter contained in the above description and shown in the accompanying 
drawings shall be interpreted as illustrative and not in a limiting sense. 
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WHAT IS CLAIMED IS ; 

1. Amethodof generating a sample of a predetermined niimbero 

probable secondary structures of an RNA sequence, comprising the steps of: 

a) generating one or more partition functions of a fragment having one or 
more bases of the RNA sequence in accordance with a predetermined 

5 number of thermodynamics parameters; and 

b) generating secondary structures based on tracebacks using conditional 
probabilities computed with the partition function. 

2. The method of claim 1 , wherein the thermodynamics parameters 
include a predetermined number of free energies for basic structural elements. 

3. Themethodofclaiml, wherein the thermodynamics parameters 
include free energies for base pair stacking in a helix. 

4. The method of claim 1 , wherein the partition function generating step 
generates partition functions for all fragments of the RNA sequence. 

5 . A method of generating a probability profile for predicting an 
accessible site on a target RNA for interaction with a biomolecule, comprising the 
steps of: 

a) generating a sample of one or more probable secondary structures of an 
5 RNA sequence by: . 

i) generating one or more partition functions of a fragment 

having one or more bases of the RNA sequence in accordance 
with a predetermmed number of thermodynamics parameters, 
and 

1 0 ii) generating secondary structures based on tracebacks using 

conditional probabilities computed with the partition 
functions; 

b) estimating a probability that a segment of one or more bases on the target 
RNA is single-stranded in accordance with an observed frequency in the 

15 sample; and 

c) repeating the estimating step for all segments on the target RNA. 
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6. A method of determining an antisense oligo of a predetemiined 
length for an antisense nucleation site on a target RNA, comprising the steps of: 

a) generating a sample of one or more probable secondary stmctures of an 
RNA sequence by: 

5 i) generating one or more partition functions of a fragment 

having one or more bases of the RNA sequence in accordance 
with a predetermined number of thermodynamics parameters, 
and 

ii) generating secondary structures based on tracebacks;using 
1 0 conditional probabilities conaputed with the partition 

functions; 

b) estimating a probability that a segment of one or more bases on the target 
RNA is single-stranded by using an observed frequency in the sample; 

c) repeating the estimating step for all segments on the target RNA; 
15 d) identifying a target segment in accordance with the estimated 

probabilities; 

e) determining a base sequence of the target segment; and 

f) determining the antisense oUgo in accordance with the base sequence 

7. A method of evaluating an antisense oUgo for a target RNA, 

comprising the steps of: 

a) generating a sample of one or more probable secondary structures of an 
RNA sequence by: 

5 i) generating one or more partition functions of a fragment 

having one or more bases of the RNA sequence in accordance 
with a predetermined number of thermodynamics parameters, 
and 

ii) generating secondary structures based on tracebacks using 
IQ conditional probabiUties computed with the partition 

functions; 



92 



wo 03/065281 ^ _^ PCT/US03/02644 





b) estimating a probability that a segment of one or more bases on the target 
RNA is single-stranded in accordance with an observed frequency in the 
sample; and 

15 c) repeating the estimating step for all segments on the target RNA; 

d) calculating a sampling-probabiUty-weighted binding energy for measuring 
a nucleation potential of a hybridization between the antisense oligo and the 
target RNA; and 

e) generating an evaluation indicator for the antisense oligo in accordance 
20 v^th the samphng-probabiUty-weighted binding energy and the estunated 

probabilities for the target RNA. 

8. The method of claim 7, wherein the calculating step includes 
applying the estimated probabilities as weights in a sunmiation of RNA:DNA 
thermodynamic parameters for the hybrid. 

9* A computer program embodied on a computer-readable medium for 
generating a sample of a predetermined number of probable secondary structures of 
an RNA sequence, comprising: 

a) an instruction for generating one or more partition functions of a fragment 
5 having one or more bases of the RNA sequence in accordance with a 

predetermined number of thermodynamics parameters; and 

b) an instruction for generatmg secondary structures based on tracebacks 
using conditional probabilities computed with the partition function. 

10. A computer program embodied on a computer-readable medium for 
generatmg a prbbabiUty profile for predicting an accessible site on a target RNA for 
interaction with a biomolecule, comprising: 

a) an instruction for generating a sample of one or more probable secondary 
5 stractures of an RNA sequence by: 

i) generating one or more partition functions of a fragment 

having one or more bases of the RNA sequence in accordance 
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with a predetermined number of theraiodynamics parameters, 
and 

10 ii) generating secondary structures based on tracebacks using 

conditional probabilities computed with the partition 
functions; 

b) an instruction for estimating a probability that a segment of one or more 
bases on the target RNA is single-stranded in accordance with an observed 
15 frequency in the sample, wherein the estimating instruction is repeated for all 

segments on the target KNA. 

11. A computer program embodied on a computer-readable medium for 
determining an antisense oligo of a predetermined length for an antisense nucleation 
site on a target RNA, comprising: 

a) an instruction for generating a sample of one or more probable secondary 
5 structures of an KNA sequence by: 

i) generating one or more partition functions of a fragment 

having one or more bases of fee RNA sequence in accordance 
with a predetermined number of thermodynamics parameters, 
and 

10 ii) generating secondary structures based on tracebacks using 

conditional probabiUties computed with the partition 
functions; 

b) an instruction for estimating a probability that a segment of one or more 
bases on the target RNA is single-stranded by using an observed frequency 

15 in the sample, said estimating instruction being repeated for all segments on 

the target RNA; 

c) an instruction for identifying a target segment in accordance with the 
estimated probabilities; 

d) an instmction for deternuning a base sequence of the target segment; and 
20 e) an instraction for determining the antisense oligo in accordance with the 

base sequence. 
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12. A computer program embodied on a computer-readable medium for 
evaluating an antisense oligo for a target RNA, comprising: 

a) an instruction for generating a sample of one or more probable secondary 
structures of an RNA sequence by: 

5 i) generating one or more partition functions of a fragment 

having one or more bases of the RNA sequence in accordance 
with a predeteraiined number of thermodynamics parameters, 
and 

ii) generating secondary structures based on tracebacks using 
1 0 conditional probabilities computed with the partition 

functions; ^ 

b) an instmction for estimating a probability that a segment of one or more 
bases on the target RNA is single-stranded in accordance with an observed 
frequency in the sample, said estimating instruction being repeated for all 

1 5 bases on the target RNA; 

c) an instraction for calculating a sampling-probability-weighted free energy 
for measuring a nucleation potential of a hybridization between the antisense 
ohgo and the target RNA; and 

d) an instruction for generating an evaluation indicator for the antisense oligo 
20 in accordance with the sampling-probability-weighted binding energy and 

the estimated probabilities for the target RNA. 

13. A process embodied in an instruction signal of a computing device 
for generating a sample of a predetermined number of probable secondary structures 
of an RNA sequence, comprising: 

a) an instruction for generating one or more partition functions of a fragment 
5 having one or more bases of the RNA sequence in accordance with a 

predetermined number of thermodynamics parameters; and 

b) an instruction for generating secondary stractures based on tracebacks 
using conditional probabilities computed with the partition functions. 
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14. A process embodied in an instruction signal of a computing device 
for generating a probability projBle for predicting an accessible site on a target RNA 
for interaction with a biomolecule, comprising: 

a) an instruction for generating a sample of one or more probable secondary 
5 structures of an RNA sequence by: 

i) generating one or more partition functions of a fragment having 
one or more bases of the RNA sequence in accordance with a predetemiined 
number of thermodynamics parameters^ and 

ii) generating secondary structures based oh tracebacks using 
10 conditional probabilities computed with the partition functions; 

b) an instruction for estimating a probability that a segment of one or more 
bases on the target RNA is single-stranded in accordance with an observed 
frequency in the sample, wherein the estimating instruction is repeated for aU 
segments on the target RNA. 

15. A process embodied in an instruction signal ofacomputmg device 
for determining an antisense oUgo of a predetemiined length for an antisense 
nucleation site on a target RNA, comprising: 

a) an instruction for generating a sample of one or more probable secondary 
5 structures of an RNA sequence by: 

i) generating one or more partition functions of a fragment 

having one or more bases of the RNA sequence in accordance 
with a predetermined number of thermodynamics parameters, 
and 

10 ii) generating secondary structures based on tracebacks using 

conditional probabilities computed with the partition 
frmctions; 

b) an instruction for estimating a probability that a segment of one or more 
bases on the target RNA is single-stranded by using an observed frequency 

15 in the sample, said estimating instruction being repeated for all segments on 

the target RNA; 
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c) an instruction for identifying a target segment in accordance with the 
estimated probabiUties; 

d) an instruction for determining a base sequence of the target segment; and 

e) an instinction for determining the antisense oUgo in accordance with the 
5 base sequence. 

16. A process embodied in an instruction signal of a computing device 
for evaluating an antisense oligo for a target RNA, comprising: 

a) an instoxiction for generating a sample of one or more probable secondary 
structures of jui RNA sequence by: 

5 i) generating one or more partition functions of a fragment 

having one or more bases of the RNA sequence in accordance 
with a predetermined number of thermodynamics parameters, 
and 

ii) generating secondary structures based on tracebacks using 
10 , conditional probabiUties computed with the partition 

functions; 

b) an instruction for estimating a probabihty that a segment of one or more 
bases on the target RNA is single-stiranded in accordance with an observed 
frequency in the sample, said estimating instinction being repeated for all 

1 5 segments on the target RNA; 

c) an instiiiction for calculating a sampling-probability-weighted free energy 
for measuring a nucleation potential of a hybridization between the antisense 
oHgo and the target RNA; and 

d) an mstruction for generating an evaluation indicator for the antisense oUgo 
20 in accordance witii the sampling-probability-weighted free energy and the 

estimated probabiUties for the target RNA. 

17. A method for transmitting information comprising paforming a 
method as claimed in any one of claims 1-8 and transmitting a result thereof 

18. A method for target prediction or for identification of effective sites 
on target RNA for gene down-regulation, or for identification of smgle-stranded 
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regions in the secondary structure of an mRNA or viral RNA, or for screening or 
designing of antisense oUgos or ribozymes, or for performing functional genomics, 
or for drug target validation, or for development of antisense therapeutics, or for the 
design of oligonucleotide probes, or for enhancing signals on nucleic acids 
hybridization arrays, or for producing higher quality array data, comprising 
performing a method as claimed in any one of claims 1-8. 

19. A method for transmitting information for or on target prediction or 
for or on identification of effective sites on target RNA for gene down-regulation, or 
for or on identification of single-stranded regions in the secondary structure of an 
mRNA or viral RNA, or for or on screening or designing of antisense oUgos or 
ribozymes, or for or on performing functional genomics, or for or on drug target 
vaUdation, or for or on development of antisense therapeutics, or for or on the design 
of oligonucleotide probes, or for or on enhancing signals on nucleic acids 
hybridization arrays, or for or on producing higher quality array data, comprising 
performing a method as claimed in any one of claims 1-8, and transmitting a result 



20. The method of claim 1 9 wherein the transmitting is via email or thfe 
internet. 



thereof. 
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Table 1. Maximum likelihood estimate (MLE) and its standard deviation (SD), and 95% 
confidence interval (CI) for Boltzmann equilibrium probability of a secondary structure for L 
collosoma SL RNA, computed from 1,000,000 independently sampledsecondary structures " 



Structuie. Boltzmann Probability 


MLE 


SD 


95%CI 


Optimal structure 


0^7469 


0.287476. 


0.000453 


(0.286588,0.288363) 


Structure 1 


0.003598 


0,003595 


0.000060 


(0.003477,0.003713) 


Structuie 2 


0.018226 


0.018219 


0.000134 


(0.017956,0.018482) 



' For any structure wfli a probability of being sampled, and for m indqienden^ sampled 
structures, Ae MLE of is P=njm, where n, is the fiequency of flie stnicture in flie sample. The 
standard deviatijin of this estimate is SD=«qrt<p(l-p)/i«), and the 95% abased on an asymptotic 
normal distribution is (p- 1.96SD- l/(2ro), /rH56SD+l/(2m)). 
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Table 2. Comparison of computation times (in seconds) for the calculation of partition functions 
(PFs) and for sampling 6f 1,000 stnictures, for a variety of biologicd sequence . 



Sequence (GenBank Accession No.) 


Length (nts) 


PFs 1,000 stnictures 


£.co/i tRNA'**'CX66S15) 


76 


0.30 


0.34 


Xlo»5SrRNA(K.02695) 


120 


1.13 


0.84 


E. coli RNase P CVW)338) 


377 


25.90 


4.56 


Rabbit P-globin mRNA (V00879) 


S89 


94.70 


10.69 


HSA" inRNA(NM_017567.1) 


1187 


781.83 


36.04 


BCRP ' mRNA (AF098951) 


2418 


6545.19 


127.69 


jE.cofi/flcZ(U00(»6) 


3113 


14003.81 


236.21 


K coU ladZ+lacY (U00096) 


4367 


39299.98 


434.12 


MRP' mRNA (L05628.1) 


5011 


59749.96 


536.18 


ESRl' mRNA(NM_000125) 


6450 


132752.20 


860.27 



" FORTRAN code of tie algorithm was executed on a 667 MHz processor of a Compaq 

AlphaStation PS20E running TruiS4 UNIX Y5.1. 

* Acnqpitf toei'ts oocyte, 

^ Homo sapiens N-acetylglucosamine kinase. 

Homo sapiens breast cancer resistance protein. 
' Human multidrug resistance-associated protein. 
^ Homo sapiens estrogen receptor 1 . 
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Table 3: Classification, representation and statistical characterization of the Boltzmann 
ensemble of the secondary structures for L. collosoma SL RNA by the examination of a 
statistical sample of 1,000 secondary structures 



Class ProbabiUty Representative aG*» 3, (kcal/mol) Boltzmann ProbabiUty 

(2Dhist) Structure (% off minimum) probability ratio 



lA (Fig. 8A) 


0.010 


Fig. 10A(foiin 1) 


-8(25.2%) 0.003598 


2.78 


IB (Fig. 8B) 


0.417 


Fig. lOB 


-10.7 (0%) 0.287469 


1.45 


IC (Fig. 8C) 


0.473 


Fig. IOC 


-10.1(5.6%) 0.108593 


4.36 


2A (Fig. 9A) 


0.073 


Fig. llA(form2) 


-9(15.9%) 0.018226 


4.01 


2B (Fig. 9B) 


0.025 


Fig. IIB 


-5.7 (65.5%) 0.000086 


290.70 



A manual examination was first performed for a smaller sample of 1 00 structuriss to 



identify characteristics of tiie classes. The characteristics provide input for a computer 
classification of the sample. Two structures missing a characteristic helix in form 2 are 
not included in class 2. The probability of a class is estimated from the sample. The free 
energy is computed with the recent Turner's parameters, and the Boltzmann probability of 
a class-representative structure is computed by equation (1); The probability ratio is the 
probability of the class divided by the Boltzmann probability. 
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Table 4. Probability estimates of structural motifs for cIII mRNA from a sample of 100 


structures 




Motif and constraint 


Probability 


AUG initiation codon in a closed region (Fig. 14A) 


0.95 


AUG initiation codon in a partly opwi region (Figs. 14B, C) 


0.05 


At lease 4 bases in either end of the Shine-Dalgarno 


0,97 


sequence are in a helical region (Fig. 14A) 




The ends of the Shine-Dalgamo sequence are open 


0.03 


but the bases in the middle are in a short helix (Fig. 14C) 




The first helix from the 5* end with 8 base pairs 


0,69 


Base pair U"-G^ 


b.93 


Unpaired C"^ and U"^' (in a hairpin) 


1.0 
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Table 5. Cotrespoiiddice betweoi ph^ogoietically deteimiiied angle-stranded regions and peaks on the 
probability profile and improvonent in predictions over minimum fi«e energy structure 



RNA sequence Accession no. Length (nts) Pc(%)- Pci(%)''Pa(%)'Pc3(%)'Pi(%y 



jE;.co/itRNA*^ 


X66515 


76 


100 


100 


100 


0 


20 


Xenopus laevis oocyte 5S rRNA 


K02695 


120 


100 


100 


100 


25 


28 


E. coH 16S iRMA domain II 


J0169S 


353 


82 


ioo 


50 


33 


29 


£. co/iKNaseP 


V0033g 


377 


100 


100 


. 58 


50 


40 


■ Tetrahymenathernu}phila 


V01416 


413 


95 


88 


67 


29 


19 


LSUfonupIintcon 

















• Pc is the peiceatage of phyiogenetically determined single-stranded regions (region here is either a 
sequence of four consecutive nucleotides or several such sequences in a row) that correspond to peaks 
(regardless die magnitude of the maximum probability) in die probability profile in Fig. 1 9. 

^ For peaks with a maximum probability ^ 0.5 , Pci is the percentage of these peaks that corresgond to 
single-stranded regions 

* Pc2 is the p^centage of die correspondence for peaks with a maxunum probability between 0.2 and 
t).5. 

Pq is the the percentage of the coirespondence for peaks with a maximum probability between 0.2 
and 0.5. 

' A probability profile predicts more single-stranded regions in the phyiogenetic stnicture than the 
minimum fi«e energy structure (Fig. 18B and Figs. 19A-D). Pj is the percentage of improvement m flie 
prediction by the probability profile over the minimum fi»e energy (MFE) structure. This is computed by 
the number of regions missed by die MFE structure but predicted by the probability profile divided by 
the total number of single-stranded regions in the phylogenetic structure (e.g., seven for Xlo 5S RNA), 
and multiplied by 100%. 
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Table 6. Cono^arison of inlubition of rabbit P-globin synthesis in cell-free translation systems 
and hybridization potential predicted by probability profile for rabbit P-globin mRNA 

ASO name target sequence / site inhibition % hybridization 

(length) on mRNA (ASO concentration) potential 

Goodchild et ah 



Ri new 


A*^ / 5UTR 


23% (5 2ulvn 


high 


P2(20) 


C«—G'"/ Start 


61%(5.2|iM) 


high 


p3 (20) 


A""— C'«/ coding 


18%(5.2nM) 


moderate 


P4(20) 


G»"—A*'*/ coding 


43%(5.2nM) 


high 


P5(22), 


A'— G"/cap 


67%(5.2|iM) 


high 


P6(23) 


U«— A^/SUTR 


47%(5J2|iM) 


high 


P7(CCC+P5,25) 


A'— G»/cap 


75% (5.2jiM) 


high 


p8(p7P6,48) 


A»— A«/oap 


89%(2.6^M) 


high 


P6+P7 (mixtuie) 


A'— A-'Vcap 


89%(2.6|iM) 


high 


ASlner et al. 








BGl (17) 


C«—U*V start 


50%(.lfiM) 


high 


BQ2 (17) 


A«—C?"/ start 


50%(.5nM) 


higjii 


BG3 (15) 


coding 


0%(l|il«0 


low 


Cazenave et al. 








17 Glo [3-19] (17) 


A'— A" /cap 


72% (.5iiM) 


high 


17 Glo (51-67] (17) 


U*«_C"/ start 


95%(.5nM) 


high 


11 Glo (11) 


A^— A** /start 


65%(.5iiM) 


high 


17 Glo [113-129] (17) 


U««_Qi»/ coding 


95%(.5|iM) 


low 
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Table 7. Comparison of the intensity of ASO:mRNA hybridization on the oligodeoxynucleotide array 
and the probability profile for the first 122 bases of rabbit P-globin mRNA 



Region " Hybridization intensity Probability profile (peak feature^ 





not detectable 


high peaks (narrow) 




high 


high peak (wide) 


A"— C?^ 


weak but detectable 


low 


A'*— A^** 


not detectable 


low 




moderate 


moderate 



• — C**" is contained in two 16-mers C*«— A*' and A'-^-C" and three 17-meis C«— U« (BGl), 
^44_^6o ^ A«_^6i hybridization yields for ASOs complementary to fliese six sequences are at 
least three times that of any other oligonucleotides in tiie anay by Milner et al 



FIG. 23 



wo 03/065281 PCT/US03/02644 





o 

CD 

d 

^ 00 



o 

c 

1 d 



G!o [51-67] 



[Glo 



P6 

Eilr 



£5J 



d 



?2 
t « 

■8 « 
■p w 



o 
o 



ilGtoii 



PI. 



Ji£ii 



II I 



|Bp2 
: 



Gto [113-1291 



, Bpa' ; [^ ty A. 



11 



i I 

5 • 

« \ . 

11/ 



'Mi 



I I I I I I 



0 10 20 30 40 50 60 70 80 90 100110120130140150160170180190200210220230 

Nucleotide Position 



FIG. 24 



wo 03/065281 




PCT/US03/02644 




FIG. 25 



wo 03/065281 ^ PCT/US03/02644 






wo (>3/()6528I 




PCT/US03/02644 



Table 8. Rationally designed antisense oligos (20-mers) targeted to E. coli lacZ mRNA 



OligoID mRNA position Oligo (5'.->3') Binding energy (kcal/mol) 



1 


24-43 


GTCATAGCTGTTTCCTGTGT 


-17.8366 


1 


92-111 


GTTGGGTAACGCCAGGGTTT 


-14.3687 


3 


226-245 


CGCTTCTGGTGCCGGAAACC 


-9.1381 


4 


548-567 


ATGCGCTCAGGTCAAATTCA 


-9.2021 


5 


651-670 


CGGAAAATGCCGCTCATCCG 


-8.2718 


6 


948-967 


TAGAGATTCGGGATTTCGGC 


-16.8185 


7 


1172-1191 


AGTTGTTCTGCTTCATCAGC 


-14.0105 


8 


1281-1300 


ATGCCGTGGGTTTCAATATT 


-15.1901 


9 


1561-1580 


CGGGAAGGGCTGGTCTTCAT 


■10.8118 


10 


2214-2233 


GGGAGCGTCACACTGAGGTT 


-14.3497 
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Table 9. Rationally designed siRNAs to target AA(N19) motif in exon 3 of ESRl mRNA 
siRNAl: 



Target positions on mRNA: 

GC content of the target sequence: 

Probability -weighted binding energy: 

Target sequence: 

Sense strand siRNA: 

Antisense strand siRNA: 



1009-1029 

47.62% 

-21,74 kcal/mol 

AACGACUAUAUGUGUCCAGCC 
CGACUAUAUGUGUCCAGCCUU 
GGCUGGACACAUAUAGUCGUU 



sjRNA2 : 



Target positions on mRNA: 

GC content of the target sequence: 

Probability-weighted binding energy: 

Target sequence: 

Sense strand siRNA: 

Antisense strand siRNA: 



1033-1053 
38.10% 

-20.16 kcal/mol 

AACCAGUGCACCAUUGAUAAA 
CCAGUGCACCAUUGAUAAAUU 
UUUAUCAAUGGUGCACUGGUU 



Target positions on mRNA: 

GC content of the target sequence: 

Probability-weighted binding energy: 

Target sequence: 

Sense strand siRNA: 

Antisense strand siRNA: 



1090-1110 
42.86% 

-15.36 kcal/mol 

AAAUGCUACGAAGUGGGAAUG 
AUGCUACGAAGUGGGAAUGUU 
CAUUCCCACUUCGUAGCAUUU 



" UU at the 3' ends of sense and antisense siRNA can be replaced by dTdT 
* Stronger antisense bindmg, thus higher siRNA potency is predicted for lower 

probability-weighted binding energy 
''Target sequence, sense and antisense siRNAs are all in 5' -> 3' direction 
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